• A Free Database of University Web Links: Data Collection Issues

      Thelwall, Mike (2003)
      This paper describes and gives access to a database of the link structures of 109 UK university and higher education college websites, as created by a specialist information science web crawler in June and July of 2001. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.This paper describes a free set of databases of the link structures of the university web sites from a selection of countries, as created by a specialist information science web crawler. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.
    • Commercial Web sites: lost in cyberspace?

      Thelwall, Mike (MCB UP Ltd, 2000)
      How easy are business Web sites for potential customers to find? This paper reports on a survey of 60,087 Web sites from 42 of the major general and commercial domains around the world to extract statistics about their design and rate of search engine registration. Search engines are used by the majority of Web surfers to find information on the Web. However, 23 per cent of business Web sites in the survey were not registered at all in the five major search engines tested and 82 per cent were not registered in at least one, missing a sizeable potential audience. There are some simple steps that should also be taken to help a Web site to be indexed properly in search engines, primarily the use of HTML META tags for indexing, but only about a third of the site home pages in the survey used them. Wide national variations were found for both indexing and META tag inclusion.
    • Effective websites for small and medium-sized enterprises

      Thelwall, Mike (MCB UP Ltd, 2000)
      In the UK, millions are now online and many are prepared to use the Internet to make and influence purchasing decisions. Businesses should, therefore, consider whether the Internet could provide them with a new marketing opportunity. Although increasing numbers of businesses now have a website, there seems to be a quality problem that is leading to missed opportunities, particularly for smaller enterprises. This belief is backed up by an automated survey of 3,802 predominantly small UK business sites, believed to be by far the largest of its kind to date. Analysis of the results reveals widespread problems in relation to search engines. Most Internet users find new sites through search engines, yet over half of the sites checked were not registered in the largest one, Yahoo!, and could therefore be missing a sizeable percentage of potential customers. The underlying problem with business sites is the lack of maturity of the medium as evidenced by the focus on technological issues amongst designers and the inevitable lack of Web-business experience of managers. Designers need to take seriously the usability of the site, its design and its ability to meet the business goals of the client. These issues are perhaps being taken up less than in the related discipline of software engineering, probably owing to the relative ease of website creation. Managers need to dictate the objectives of their site, but also, in the current climate, cannot rely even on professional website design companies and must be capable of evaluating the quality of their site themselves. Finally, educators need to ensure that these issues are emphasised to the next generation of designers and managers in order that the full potential of the Internet for business can be realised.
    • New versions of PageRank employing alternative Web document models

      Thelwall, Mike; Vaughan, Liwen (Emerald Group Publishing Limited, 2004)
      Introduces several new versions of PageRank (the link based Web page ranking algorithm), based on an information science perspective on the concept of the Web document. Although the Web page is the typical indivisible unit of information in search engine results and most Web information retrieval algorithms, other research has suggested that aggregating pages based on directories and domains gives promising alternatives, particularly when Web links are the object of study. The new algorithms introduced based on these alternatives were used to rank four sets of Web pages. The ranking results were compared with human subjects’ rankings. The results of the tests were somewhat inconclusive: the new approach worked well for the set that includes pages from different Web sites; however, it does not work well in ranking pages that are from the same site. It seems that the new algorithms may be effective for some tasks but not for others, especially when only low numbers of links are involved or the pages to be ranked are from the same site or directory.
    • Subject gateway sites and search engine ranking.

      Thelwall, Mike (MCB UP Ltd, 2002)
      The spread of subject gateway sites can have an impact on the other major Web information retrieval tool: the commercial search engine. This is because gateway sites perturb the link structure of the Web, something used to rank matches in search engine results pages. The success of Google means that its PageRank algorithm for ranking the importance of Web pages is an object of particular interest, and it is one of the few published ranking algorithms. Although highly mathematical, PageRank admits a simple underlying explanation that allows an analysis of its impact on Web spaces. It is shown that under certain stated assumptions gateway sites can actually decrease the PageRank of their targets. Suggestions are made for gateway site designers and other Web authors to minimise this.
    • Web impact factors and search engine coverage

      Thelwall, Mike (MCB UP Ltd, 2000)
      Search engines index only a proportion of the web and this proportion is not determined randomly but by following algorithms that take into account the properties that impact factors measure. A survey was conducted in order to test the coverage of search engines and to decide whether their partial coverage is indeed an obstacle to using them to calculate web impact factors. The results indicate that search engine coverage, even of large national domains is extremely uneven and would be likely to lead to misleading calculations.