• Mendeley readership altmetrics for medical articles: An analysis of 45 fields

      Wilson, Paul; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK (Wiley Blackwell, 2015-05)
    • Methodologies for crawler based Web surveys.

      Thelwall, Mike (MCB UP Ltd, 2002)
      There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well-known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.
    • Microsoft Academic automatic document searches: accuracy for journal articles and suitability for citation analysis

      Thelwall, Mike (Elsevier, 2017-11-22)
      Microsoft Academic is a free academic search engine and citation index that is similar to Google Scholar but can be automatically queried. Its data is potentially useful for bibliometric analysis if it is possible to search effectively for individual journal articles. This article compares different methods to find journal articles in its index by searching for a combination of title, authors, publication year and journal name and uses the results for the widest published correlation analysis of Microsoft Academic citation counts for journal articles so far. Based on 126,312 articles from 323 Scopus subfields in 2012, the optimal strategy to find articles with DOIs is to search for them by title and filter out those with incorrect DOIs. This finds 90% of journal articles. For articles without DOIs, the optimal strategy is to search for them by title and then filter out matches with dissimilar metadata. This finds 89% of journal articles, with an additional 1% incorrect matches. The remaining articles seem to be mainly not indexed by Microsoft Academic or indexed with a different language version of their title. From the matches, Scopus citation counts and Microsoft Academic counts have an average Spearman correlation of 0.95, with the lowest for any single field being 0.63. Thus, Microsoft Academic citation counts are almost universally equivalent to Scopus citation counts for articles that are not recent but there are national biases in the results.
    • Monitoring Twitter strategies to discover resonating topics: The case of the UNDP

      Thelwall, Mike; Cugelman, Brian (EPI - El Profesional de la información., 2017-08-02)
      Many organizations use social media to attract supporters, disseminate information and advocate change. Services like Twitter can theoretically deliver messages to a huge audience that would be difficult to reach by other means. This article introduces a method to monitor an organization’s Twitter strategy and applies it to tweets from United Nations Development Programme (UNDP) accounts. The Resonating Topic Method uses automatic analyses with free software to detect successful themes within the organization’s tweets, categorizes the most successful tweets, and analyses a comparable organization to identify new successful strategies. In the case of UNDP tweets from November 2014 to March 2015, the results confirm the importance of official social media accounts as well as those of high profile individuals and general supporters. Official accounts seem to be more successful at encouraging action, which is a critical aspect of social media campaigning. An analysis of Oxfam found a successful social media approach that the UNDP had not adopted, showing the value of analyzing other organizations to find potential strategy gaps.
    • Motivations for academic web site interlinking: evidence for the Web as a novel source of information on informal scholarly communication

      Wilkinson, David; Harries, Gareth; Thelwall, Mike; Price, Liz (Sage, 2003)
      The need to understand authors’ motivations for creating links between university web sites is addressed by a survey of a random collection of 414 such links from the ac.uk domain. A classification scheme was created and applied to this collection. Obtaining inter-classifier agreement as to the single main link creation cause was very difficult because of multiple potential motivations and the fluidity of genre on the Web. Nevertheless, it was clear that, whilst the vast majority, over 90%, was created for broadly scholarly reasons, only two were equivalent to journal citations. It is concluded that academic web link metrics will be dominated by a range of informal types of scholarly communication. Since formal communication can be extensively studied through citation analysis, this provides an exciting new window through which to investigate a facet of a previously obscured type of communication activity.
    • Multi-document summarization of news articles using an event-based framework

      Ou, Shiyan; Khoo, Christopher S.G.; Goh, Dion H. (Emerald, 2006)
      Purpose – The purpose of this research is to develop a method for automatic construction of multi-document summaries of sets of news articles that might be retrieved by a web search engine in response to a user query. Design/methodology/approach – Based on the cross-document discourse analysis, an event-based framework is proposed for integrating and organizing information extracted from different news articles. It has a hierarchical structure in which the summarized information is presented at the top level and more detailed information given at the lower levels. A tree-view interface was implemented for displaying a multi-document summary based on the framework. A preliminary user evaluation was performed by comparing the framework-based summaries against the sentence-based summaries. Findings – In a small evaluation, all the human subjects preferred the framework-based summaries to the sentence-based summaries. It indicates that the event-based framework is an effective way to summarize a set of news articles reporting an event or a series of relevant events. Research limitations/implications – Limited to event-based news articles only, not applicable to news critiques and other kinds of news articles. A summarization system based on the event-based framework is being implemented. Practical implications – Multi-document summarization of news articles can adopt the proposed event-based framework. Originality/value – An event-based framework for summarizing sets of news articles was developed and evaluated using a tree-view interface for displaying such summaries.
    • Multiword units in machine translation and translation technology

      Ruslan, Mitkov; Monti, Johanna; Corpas Pastor, Gloria; Seretan, Violeta (John Benjamins, 2018-07-20)
      The correct interpretation of Multiword Units (MWUs) is crucial to many applications in Natural Language Processing but is a challenging and complex task. In recent years, the computational treatment of MWUs has received considerable attention but we believe that there is much more to be done before we can claim that NLP and Machine Translation (MT) systems process MWUs successfully. In this chapter, we present a survey of the field with particular reference to Machine Translation and Translation Technology.
    • Mutual terminology extraction using a statistical framework

      Ha, Le An; Mitkov, Ruslan; Pastor, Gloria Corpas (Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), 2008-06-16)
      In this paper, we explore a statistical framework for mutual bilingual terminology extraction. We propose three probabilistic models to assess the proposition that automatic alignment can play an active role in bilingual terminology extraction and translate it into mutual bilingual terminology extraction. The results indicate that such models are valid and can show that mutual bilingual terminology extraction is indeed a viable approach.
    • National Scientific Performance Evolution Patterns: Retrenchment, Successful Expansion, or Overextension

      Thelwall, Mike; Levitt, Jonathan M. (Wiley-Blackwell, 2017-11-17)
      National governments would like to preside over an expanding and increasingly high impact science system but are these two goals largely independent or closely linked? This article investigates the relationship between changes in the share of the world’s scientific output and changes in relative citation impact for 2.6 million articles from 26 fields in the 25 countries with the most Scopus-indexed journal articles from 1996 to 2015. There is a negative correlation between expansion and relative citation impact but their relationship varies. China, Spain, Australia, and Poland were successful overall across the 26 fields, expanding both their share of the world’s output and its relative citation impact, whereas Japan, France, Sweden and Israel had decreased shares and relative citation impact. In contrast, the USA, UK, Germany, Italy, Russia, Netherlands, Switzerland, Finland, and Denmark all enjoyed increased relative citation impact despite a declining share of publications. Finally, India, South Korea, Brazil, Taiwan, and Turkey all experienced sustained expansion but a recent fall in relative citation impact. These results may partly reflect changes in the coverage of Scopus and the selection of fields.
    • New directions in the study of family names

      Hanks, Patrick; Boullón Agrelo, Ana Isabel (Consello da Cultura Galega, 2018-12-28)
      This paper explores and explains recent radical developments in resources and methodology for studying the origins, cultural associations, and histories of family names (also called ‘surnames’). It summarizes the current state of the art and outlines new resources and procedures that are now becoming available. It shows how such innovations can enable the correction of errors in previous work and improve the accuracy of dictionaries of family names, with a focus on the English-speaking world. Developments such as the digitization of archives are having a profound effect, not only on the interpretation and understanding of traditional, ‘established’ family names and their histories, but also of names in other languages and other cultures. There are literally millions of different family names in the world today, many of which have never been studied at all. What are good criteria for selection of entries in a dictionary of family names, and what can be said about them? What is the nature of the evidence? How stable (or how variable) are family names over time? What are the effects of factors such as migration? What is the relationship between family names and geographical locations, given that people can and do move around? What is the relationship between traditional philological and historical approaches to the subject and statistical analysis of newly available digitized data? The paper aims to contribute to productive discussion of such questions.
    • New versions of PageRank employing alternative Web document models

      Thelwall, Mike; Vaughan, Liwen (Emerald Group Publishing Limited, 2004)
      Introduces several new versions of PageRank (the link based Web page ranking algorithm), based on an information science perspective on the concept of the Web document. Although the Web page is the typical indivisible unit of information in search engine results and most Web information retrieval algorithms, other research has suggested that aggregating pages based on directories and domains gives promising alternatives, particularly when Web links are the object of study. The new algorithms introduced based on these alternatives were used to rank four sets of Web pages. The ranking results were compared with human subjects’ rankings. The results of the tests were somewhat inconclusive: the new approach worked well for the set that includes pages from different Web sites; however, it does not work well in ranking pages that are from the same site. It seems that the new algorithms may be effective for some tasks but not for others, especially when only low numbers of links are involved or the pages to be ranked are from the same site or directory.
    • News stories as evidence for research? BBC citations from articles, books and Wikipedia

      Kousha, Kayvan; Thelwall, Mike (John Wiley & Sons, 2017-07-17)
      Although news stories target the general public and are sometimes inaccurate, they can serve as sources of real-world information for researchers. This article investigates the extent to which academics exploit journalism using content and citation analyses of online BBC News stories cited by Scopus articles. A total of 27,234 Scopus-indexed publications have cited at least one BBC News story, with a steady annual increase. Citations from arts and humanities (2.8% of publications in 2015) and social sciences (1.5%) were more likely than citations from medicine (0.1%) and science (<0.1%). Surprisingly, half of the sampled Scopus-cited science and technology (53%) and medicine and health (47%) stories were based on academic research, rather than otherwise unpublished information, suggesting that researchers have chosen a lower quality secondary source for their citations. Nevertheless, the BBC News stories that were most frequently cited by Scopus, Google Books and Wikipedia introduced new information from many different topics, including politics, business, economics, statistics, and reports about events. Thus, news stories are mediating real world knowledge into the academic domain, a potential cause for concern.
    • Nine terminology extraction Tools: Are they useful for translators?

      Costa, Hernani; Zaretskaya, Anna; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam (MultiLingual, 2016-04-01)
      Terminology extraction tools (TETs) have become an indispensable resource in education, research and business. Today, users can find a great variety of terminology extraction tools of all kinds, and they all offer different features. Apart from many other areas, these tools are especially helpful in the professional translation setting. We do not know, however, if the existing tools have all the necessary features for this kind of work. In search for the answer, we looked at nine selected tools available on the market to find out if they provide the translators’ most favorite features.
    • NLP-enhanced self-study learning materials for quality healthcare in Europe

      Urbano Mendaña, Míriam; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam; Aguado de Cea, G; Aussenac-Gilles, N; Nazarenko, A; Szulman, S (Université Paris 13, 2013-10)
      In this paper we present an overview of the TELL-ME project, which aims to develop innovative e-learning tools and self-study materials for teaching vocationally-specific languages to healthcare professionals, helping them to communicate at work. The TELL-ME e-learning platform incorporates a variety of NLP techniques to provide an array of diverse work-related exercises, selfassessment tools and an interactive dictionary of key vocabulary and concepts aimed at medics for Spanish, English and German. A prototype of the e-learning platform is currently under evaluation.
    • Not all international collaboration is beneficial: The Mendeley readership and citation impact of biochemical research collaboration

      Sud, Pardeep; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1SB UK; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1SB UK (Wiley Blackwell, 2015-05-13)
      Biochemistry is a highly funded research area that is typified by large research teams and is important for many areas of the life sciences. This article investigates the citation impact and Mendeley readership impact of biochemistry research from 2011 in the Web of Science according to the type of collaboration involved. Negative binomial regression models are used that incorporate, for the first time, the inclusion of specific countries within a team. The results show that, holding other factors constant, larger teams robustly associate with higher impact research, but including additional departments has no effect and adding extra institutions tends to reduce the impact of research. Although international collaboration is apparently not advantageous in general, collaboration with the USA, and perhaps also with some other countries, seems to increase impact. In contrast, collaborations with some other nations associate with lower impact, although both findings could be due to factors such as differing national proportions of excellent researchers. As a methodological implication, simpler statistical models would have found international collaboration to be generally beneficial and so it is important to take into account specific countries when examining collaboration.
    • NP animacy identification for anaphora resolution

      Orasan, Constantin; Evans, Richard (American Association for Artificial Intelligence, 2007)
      In anaphora resolution for English, animacy identification can play an integral role in the application of agreement restrictions between pronouns and candidates, and as a result, can improve the accuracy of anaphora resolution systems. In this paper, two methods for animacy identification are proposed and evaluated using intrinsic and extrinsic measures. The first method is a rule-based one which uses information about the unique beginners in WordNet to classify NPs on the basis of their animacy. The second method relies on a machine learning algorithm which exploits a WordNet enriched with animacy information for each sense. The effect of word sense disambiguation on the two methods is also assessed. The intrinsic evaluation reveals that the machine learning method reaches human levels of performance. The extrinsic evaluation demonstrates that animacy identification can be beneficial in anaphora resolution, especially in the cases where animate entities are identified with high precision.
    • Object and subject Heavy-NP shift in Arabic

      Mohamed, Emad (Research in Corpus Linguistics, 2014-12-31)
      In order to examine whether Arabic has Heavy Noun Phrase Shifting (HNPS), I have extracted from the Prague Arabic Dependency Treebank a data set in which a verb governs either an object NP and an Adjunct Phrase (PP or AdvP) or a subject NP and an Adjunct Phrase. I have used binary logistic regression where the criterion variable is whether the subject/object NP shifts, and used as predictor variables heaviness (the number of tokens per NP, adjunct), part of speech tag, verb disposition (ie. whether the verb has a history of taking double objects or sentential objects), NP number, NP definiteness, and the presence of referring pronouns in either the NP or the adjunct. The results show that only object heaviness and adjunct heaviness are useful predictors of object HNPS, while subject heaviness, adjunct heaviness, subject part of speech tag, definiteness, and adjunct head POS tags are active predictors of subject HNPS. I also show that HNPS can in principle be predicted from sentence structure.
    • Patent citation analysis with Google

      Kousha, Kayvan; Thelwall, Mike (Wiley-Blackwell, 2015-09-23)
      Citations from patents to scientific publications provide useful evidence about the commercial impact of academic research, but automatically searchable databases are needed to exploit this connection for large-scale patent citation evaluations. Google covers multiple different international patent office databases but does not index patent citations or allow automatic searches. In response, this article introduces a semiautomatic indirect method via Bing to extract and filter patent citations from Google to academic papers with an overall precision of 98%. The method was evaluated with 322,192 science and engineering Scopus articles from every second year for the period 1996–2012. Although manual Google Patent searches give more results, especially for articles with many patent citations, the difference is not large enough to be a major problem. Within Biomedical Engineering, Biotechnology, and Pharmacology & Pharmaceutics, 7% to 10% of Scopus articles had at least one patent citation but other fields had far fewer, so patent citation analysis is only relevant for a minority of publications. Low but positive correlations between Google Patent citations and Scopus citations across all fields suggest that traditional citation counts cannot substitute for patent citations when evaluating research.
    • Predicting reading difficulty for readers with autism spectrum disorder

      Evans, Richard; Yaneva, Victoria; Temnikova, Irina (European Language Resources Association, 2016-05-23)
      People with autism experience various reading comprehension difficulties, which is one explanation for the early school dropout, reduced academic achievement and lower levels of employment in this population. To overcome this issue, content developers who want to make their textbooks, websites or social media accessible to people with autism (and thus for every other user) but who are not necessarily experts in autism, can benefit from tools which are easy to use, which can assess the accessibility of their content, and which are sensitive to the difficulties that autistic people might have when processing texts/websites. In this paper we present a preliminary machine learning readability model for English developed specifically for the needs of adults with autism. We evaluate the model on the ASD corpus, which has been developed specifically for this task and is, so far, the only corpus for which readability for people with autism has been evaluated. The results show that out model outperforms the baseline, which is the widely-used Flesch-Kincaid Grade Level formula.
    • Predicting the difficulty of multiple choice questions in a high-stakes medical exam

      Ha, Le; Yaneva, Victoria; Balwin, Peter; Mee, Janet (Association for Computational Linguistics, 2019-08-02)
      Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs.