• A Free Database of University Web Links: Data Collection Issues

      Thelwall, Mike (2003)
      This paper describes and gives access to a database of the link structures of 109 UK university and higher education college websites, as created by a specialist information science web crawler in June and July of 2001. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.This paper describes a free set of databases of the link structures of the university web sites from a selection of countries, as created by a specialist information science web crawler. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.
    • A layered approach for investigating the topological structure of communities in the Web.

      Thelwall, Mike (MCB UP Ltd, 2003)
      A layered approach for identifying communities in the Web is presented and explored by applying the flake exact community identification algorithm to the UK academic Web. Although community or topic identification is a common task in information retrieval, a new perspective is developed by: the application of alternative document models, shifting the focus from individual pages to aggregated collections based upon Web directories, domains and entire sites; the removal of internal site links; and the adaptation of a new fast algorithm to allow fully-automated community identification using all possible single starting points. The overall topology of the graphs in the three least-aggregated layers was first investigated and found to include a large number of isolated points but, surprisingly, with most of the remainder being in one huge connected component, exact proportions varying by layer. The community identification process then found that the number of communities far exceeded the number of topological components, indicating that community identification is a potentially useful technique, even with random starting points. Both the number and size of communities identified was dependent on the parameter of the algorithm, with very different results being obtained in each case. In conclusion, the UK academic Web is embedded with layers of non-trivial communities and, if it is not unique in this, then there is the promise of improved results for information retrieval algorithms that can exploit this additional structure, and the application of the technique directly to partially automate Web metrics tasks such as that of finding all pages related to a given subject hosted by a single country's universities.
    • A Longitudinal Study of Academic Web Links: Identifying and Explaining Change

      Payne, Nigel (University of Wolverhampton, 2007)
      A problem common to all current web link analyses is that, as the web is continuously evolving, any web-based study may be out of date by the time it is published in academic literature. It is therefore important to know how web link analyses results vary over time, with a low rate of variation lengthening the amount of time corresponding to a tolerable loss in quality. Moreover, given the lack of research on how academic web spaces change over time, from an information science perspective it would interesting to see what patterns and trends could be identified by longitudinal research and the study of university web links seems to provide a convenient means by which to do so. The aim of this research is to identify and track changes in three academic webs (UK, Australia and New Zealand) over time, tracking various aspects of academic webs including site size and overall linking characteristics, and to provide theoretical explanations of the changes found. This should therefore provide some insight into the stability of previous and future webometric analyses. Alternative Document Models (ADMs), created with the purpose of reducing the extent to which anomalies occur in counts of web links at the page level, have been used extensively within webometrics as an alternative to using the web page as the basic unit of analysis. This research carries out a longitudinal study of ADMs in an attempt to ascertain which model gives the most consistent results when applied to the UK, Australia and New Zealand academic web spaces over the last six years. The results show that the domain ADM gives the most consistent results with the directory ADM also giving more reliable results than are evident when using the standard page model. Aggregating at the site (or university) level appears to provide less consistent results than using the page as the standard unit of measure, and this finding holds true over all three academic webs and for each time period examined over the last six years. The question of whether university web sites publish the same kind of information and use the same kind of hyperlinks year on year is important from the perspective of interpreting the results of academic link analyses, because changes in link types over time would also force interpretations of link analyses to change over time. This research uses a link classification exercise to identify temporal changes in the distribution of different types of academic web links, using three academic web spaces in the years 2000 and 2006. Significant increases in ‘research oriented’, ‘social/leisure’ and ‘superficial’ links were identified as well as notable decreases in the ‘technical’ and ‘personal’ links. Some of these changes identified may be explained by general changes in the management of university web sites and some by more wide-spread Internet trends, e.g., dynamic pages, blogs and social networking. The increase in the proportion of research-oriented links is particularly hopeful for future link analysis research. Identifying quantitative trends in the UK, Australian and New Zealand academic webs from 2000 to 2005 revealed that the number of static pages and links in each of the three academic webs appears to have stabilised as far back as 2001. This stabilisation may be partly due to an increase in dynamic pages which are normally excluded from webometric analyses. In response to the problem for webometricians due to the constantly changing nature of the Internet, the results presented here are encouraging evidence that webometrics for academic spaces may have a longer-term validity than would have been previously assumed. The relationship between university inlinks and research activity indicators over time was examined, as well as the reasons for individual universities experiencing significant increases and decreases in inlinks over the last six years. The findings indicate that between 66% and 70% of outlinks remain the same year on year for all three academic web spaces, although this stability conceals large individual differences. Moreover, there is evidence of a level of stability over time for university site inlinks when measured against research. Surprisingly however, inlink counts can vary significantly from year to year for individual universities, for reasons unrelated to research, underlining that webometric results should be interpreted cautiously at the level of individual universities. Therefore, on average since 2001 the university web sites of the UK, Australia and New Zealand have been relatively stable in terms of size and linking patterns, although this hides a constant renewing of old pages and areas of the sites. In addition, the proportion of research-related links seems to be slightly increasing. Whilst the former suggests that webometric results are likely to have a surprisingly long shelf-life, perhaps closer to five years than one year, the latter suggests that webometrics is going to be increasingly useful as a tool to track research online. While there have already been many studies involving academic webs spaces, and much work has been carried out on the web from a longitudinal perspective, this thesis concentrates on filling a critical gap in current webometric research by combining the two and undertaking a longitudinal study of academic webs. In comparison with previous web-related longitudinal studies this thesis makes a number of novel contributions. Some of these stem from extending established webometric results, either by introducing a longitudinal aspect (looking at how various academic web metrics such as research activity indicators, site size or inlinks change over time) or by their application to other countries. Other contributions are made by combining traditional webometric methods (e.g. combining topical link classification exercises with longitudinal study) or by identifying and examining new areas for research (for example, dynamic pages and non-HTML documents). No previous web-based longitudinal studies have focused on academic links and so the main findings that (for UK, Australian and New Zealand academic webs between 2000 and 2006) certain academic link types exhibit changing patterns over time, approximately two-thirds of outlinks remain the same year on year and the number of static pages and links appears to have stabilised are both significant and novel.
    • An initial exploration of the link relationship between UK university Web sites.

      Thelwall, Mike (MCB UP Ltd, 2002)
      Aggregates of links are of interest to information scientists in the same way as citation counts are: as potential sources of data from which new knowledge can be mined. Builds on the recent discovery of a correlation between a Web link count measure and the research quality of British universities by applying a range of multivariate statistical techniques to counts of links between pairs of universities. This represents an initial attempt at developing an understanding of this phenomenon. Extracts plausible results. Also identifies outliers in the data by the techniques, some of which were verified by being tracked down to identifiable Web phenomena. This is an important outcome because successful anomaly identification is a precondition to more effective analysis of this kind of data. The identification of groupings is encouraging evidence that Web links between universities can be mined for significant results, although it is clear that more methodological development is needed, if any but the simplest patterns are to be extracted. Finally, based upon the types of patterns extracted, argues that none of the methods used are capable of fully analysing link structures on their own.
    • Disciplinary Differences in Academic Web Presence – A Statistical Study of the UK

      Thelwall, Mike; Price, Liz (Walter de Gruyter, 2003)
      The Web has become an important tool for scholars to publicise their activities and disseminate their findings. In the information age, those who do not use it risk being bypassed. In this paper we introduce a statistical technique to assess the extent to which the broad spectrum of research areas are visible online in UK universities. Five broad subject categories are used for research, and inlink counts are used as indicators of online visibility or impact. The approach is designed to give more complete subject coverage than previous studies and to avoid the conceptual difficulties of a page classification approach, although one is used for triangulation. The results suggest that Science and Engineering dominate university Web presences, but with Humanities and Arts also achieving a high presence relative to its size, showing that high Web impact does not have to be restricted to the sciences. Research funding bodies should now consider whether action needs to be taken to ensure that opportunities are not being missed in the lower Web impact areas.
    • Do the Web sites of higher rated scholars have significantly more online impact?

      Thelwall, Mike; Harries, Gareth (Wiley, 2004)
      The quality and impact of academic Web sites is of interest to many audiences, including the scholars who use them and Web educators who need to identify best practice. Several large-scale European Union research projects have been funded to build new indicators for online scientific activity, reflecting recognition of the importance of the Web for scholarly communication. In this paper we address the key question of whether higher rated scholars produce higher impact Web sites, using the United Kingdom as a case study and measuring scholars' quality in terms of university-wide average research ratings. Methodological issues concerning the measurement of the online impact are discussed, leading to the adoption of counts of links to a university's constituent single domain Web sites from an aggregated counting metric. The findings suggest that universities with higher rated scholars produce significantly more Web content but with a similar average online impact. Higher rated scholars therefore attract more total links from their peers, but only by being more prolific, refuting earlier suggestions. It can be surmised that general Web publications are very different from scholarly journal articles and conference papers, for which scholarly quality does associate with citation impact. This has important implications for the construction of new Web indicators, for example that online impact should not be used to assess the quality of small groups of scholars, even within a single discipline.
    • Evidence for the existence of geographic trends in university web site interlinking

      Thelwall, Mike (MCB UP Ltd, 2002)
      The Web is an important medium for scholarly communication of various types, perhaps eventually to replace entirely some traditional mechanisms such as print journals. Yet the Web analogy of citations, hyperlinks, are much more varied in use and existing citation techniques are difficult to generalise to the new medium. In this context, one new challenging object of study is the modern multi-faceted, multi-genre, partly unregulated university Web site. This paper develops a methodology to analyse the patterns of interlinking between university Web sites and uses it to indicate that the degree of interlinking decreases with distance, at least in the UK. This is perhaps not in itself a surprising result, despite claims of a paradigm shift from the traditional virtual college towards collaboratories, but the methodology developed can also be used to refine existing Web link metrics to produce more powerful tools for comparing groups of sites.
    • Finding similar academic Web sites with links, bibliometric couplings and colinks

      Thelwall, Mike; Wilkinson, David (Elsevier, 2004)
      A common task in both Webmetrics and Web information retrieval is to identify a set of Web pages or sites that are similar in content. In this paper we assess the extent to which links, colinks and couplings can be used to identify similar Web sites. As an experiment, a random sample of 500 pairs of domains from the UK academic Web were taken and human assessments of site similarity, based upon content type, were compared against ratings for the three concepts. The results show that using a combination of all three gives the highest probability of identifying similar sites, but surprisingly this was only a marginal improvement over using links alone. Another unexpected result was that high values for either colink counts or couplings were associated with only a small increased likelihood of similarity. The principal advantage of using couplings and colinks was found to be greater coverage in terms of a much larger number of pairs of sites being connected by these measures, instead of increased probability of similarity. In information retrieval terminology, this is improved recall rather than improved precision.
    • Graph structure in three national academic Webs: Power laws with anomalies

      Thelwall, Mike; Wilkinson, David (Wiley, 2003)
      The graph structures of three national university publicly indexable Webs from Australia, New Zealand, and the UK were analyzed. Strong scale-free regularities for page indegrees, outdegrees, and connected component sizes were in evidence, resulting in power laws similar to those previously identified for individual university Web sites and for the AltaVista-indexed Web. Anomalies were also discovered in most distributions and were tracked down to root causes. As a result, resource driven Web sites and automatically generated pages were identified as representing a significant break from the assumptions of previous power law models. It follows that attempts to track average Web linking behavior would benefit from using techniques to minimize or eliminate the impact of such anomalies.
    • Hyperlinks as a data source for science mapping

      Harries, Gareth; Wilkinson, David; Price, Liz; Fairclough, Ruth; Thelwall, Mike (Sage, 2004)
      Hyperlinks between academic web sites, like citations, can potentially be used to map disciplinary structures and identify evidence of connections between disciplines. In this paper we classified a sample of links originating in three different disciplines: maths, physics and sociology. Links within a discipline were found to be different in character to links between pages in different disciplines. There were also disciplinary differences in both types of link. As a consequence, we argue that interpretations of web science maps covering multiple disciplines will need to be sensitive to the contexts of the links mapped.
    • Linguistic patterns of academic Web use in Western Europe

      Thelwall, Mike; Tang, Rong; Price, Liz (Springer, 2003)
      A survey of linguistic dimensions of Web site hosting and interlinking of the universities of sixteen European countries is described. The results show that English is the dominant language both for linking pages and for all pages. In a typical country approximately half the pages were in English and half in one or more national languages. Normalised interlinking patterns showed three trends: 1) international interlinking throughout Europe in English, and additionally in Swedish in Scandinavia; 2) linking between countries sharing a common language, and 3) countries extensively hosting international links in their own major languages. This provides evidence for the multilingual character of academic use of the Web in Western Europe, at least outside the UK and Eire. Evidence was found that Greece was significantly linguistically isolated from the rest of the EU but that outsiders Norway and Switzerland were not.
    • Motivations for academic web site interlinking: evidence for the Web as a novel source of information on informal scholarly communication

      Wilkinson, David; Harries, Gareth; Thelwall, Mike; Price, Liz (Sage, 2003)
      The need to understand authors’ motivations for creating links between university web sites is addressed by a survey of a random collection of 414 such links from the ac.uk domain. A classification scheme was created and applied to this collection. Obtaining inter-classifier agreement as to the single main link creation cause was very difficult because of multiple potential motivations and the fluidity of genre on the Web. Nevertheless, it was clear that, whilst the vast majority, over 90%, was created for broadly scholarly reasons, only two were equivalent to journal citations. It is concluded that academic web link metrics will be dominated by a range of informal types of scholarly communication. Since formal communication can be extensively studied through citation analysis, this provides an exciting new window through which to investigate a facet of a previously obscured type of communication activity.
    • Three target document range metrics for university web sites

      Thelwall, Mike; Wilkinson, David (Wiley, 2003)
      Three new metrics are introduced that measure the range of use of a university Web site by its peers through different heuristics for counting links targeted at its pages. All three give results that correlate significantly with the research productivity of the target institution. The directory range model, which is based upon summing the number of distinct directories targeted by each other university, produces the most promising results of any link metric yet. Based upon an analysis of changes between models, it is suggested that range models measure essentially the same quantity as their predecessors but are less susceptible to spurious causes of multiple links and are therefore more robust.
    • Which academic subjects have most online impact? A pilot study and a new classification process

      Thelwall, Mike; Vaughan, Liwen; Cothey, Viv; Li, Xuemei; Smith, Alastair G. (MCB UP Ltd, 2003)
      The use of the Web by academic researchers is discipline-dependent and highly variable. It is increasingly central for sharing information, disseminating results and publicising research projects. This pilot study seeks to identify the subjects that have the most impact on the Web, and look for national differences in online subject visibility. The highest impact sites were from computing, but there were major national differences in the impact of engineering and technology sites. Another difference was that Taiwan had more high impact non-academic sites hosted by universities. As a pilot study, the classification process itself was also investigated and the problems of applying subject classification to academic Web sites discussed. The study draws out a number of issues in this regard, having no simple solutions and point to the need to interpret the results with caution.