A Free Database of University Web Links: Data Collection Issues
dc.contributor.author | Thelwall, Mike | |
dc.date.accessioned | 2008-01-10T12:55:31Z | |
dc.date.available | 2008-01-10T12:55:31Z | |
dc.date.issued | 2003 | |
dc.identifier.citation | International Journal of Scientometrics, Informetrics and Bibliometrics, 2002/3, 6/7(1): Paper 2 | |
dc.identifier.issn | 1137-5019 | |
dc.identifier.uri | http://hdl.handle.net/2436/15919 | |
dc.description | Metadata only. Full text available at above link. | |
dc.description.abstract | This paper describes and gives access to a database of the link structures of 109 UK university and higher education college websites, as created by a specialist information science web crawler in June and July of 2001. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.This paper describes a free set of databases of the link structures of the university web sites from a selection of countries, as created by a specialist information science web crawler. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring. | |
dc.language.iso | en | |
dc.relation.url | http://www.cindoc.csic.es/cybermetrics/articles/v6i1p2.html | |
dc.subject | Websites | |
dc.subject | Universities | |
dc.subject | Academic websites | |
dc.subject | Web impact factors | |
dc.subject | Search engines | |
dc.subject | Web crawlers | |
dc.subject | Weblinks | |
dc.title | A Free Database of University Web Links: Data Collection Issues | |
dc.type | Journal article | |
html.description.abstract | This paper describes and gives access to a database of the link structures of 109 UK university and higher education college websites, as created by a specialist information science web crawler in June and July of 2001. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.This paper describes a free set of databases of the link structures of the university web sites from a selection of countries, as created by a specialist information science web crawler. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring. |