• A Free Database of University Web Links: Data Collection Issues

      Thelwall, Mike (2003)
      This paper describes and gives access to a database of the link structures of 109 UK university and higher education college websites, as created by a specialist information science web crawler in June and July of 2001. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.This paper describes a free set of databases of the link structures of the university web sites from a selection of countries, as created by a specialist information science web crawler. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.
    • Disciplinary Differences in Academic Web Presence – A Statistical Study of the UK

      Thelwall, Mike; Price, Liz (Walter de Gruyter, 2003)
      The Web has become an important tool for scholars to publicise their activities and disseminate their findings. In the information age, those who do not use it risk being bypassed. In this paper we introduce a statistical technique to assess the extent to which the broad spectrum of research areas are visible online in UK universities. Five broad subject categories are used for research, and inlink counts are used as indicators of online visibility or impact. The approach is designed to give more complete subject coverage than previous studies and to avoid the conceptual difficulties of a page classification approach, although one is used for triangulation. The results suggest that Science and Engineering dominate university Web presences, but with Humanities and Arts also achieving a high presence relative to its size, showing that high Web impact does not have to be restricted to the sciences. Research funding bodies should now consider whether action needs to be taken to ensure that opportunities are not being missed in the lower Web impact areas.
    • Which academic subjects have most online impact? A pilot study and a new classification process

      Thelwall, Mike; Vaughan, Liwen; Cothey, Viv; Li, Xuemei; Smith, Alastair G. (MCB UP Ltd, 2003)
      The use of the Web by academic researchers is discipline-dependent and highly variable. It is increasingly central for sharing information, disseminating results and publicising research projects. This pilot study seeks to identify the subjects that have the most impact on the Web, and look for national differences in online subject visibility. The highest impact sites were from computing, but there were major national differences in the impact of engineering and technology sites. Another difference was that Taiwan had more high impact non-academic sites hosted by universities. As a pilot study, the classification process itself was also investigated and the problems of applying subject classification to academic Web sites discussed. The study draws out a number of issues in this regard, having no simple solutions and point to the need to interpret the results with caution.