Recent Submissions

  • Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories

    Martín-Martín, Alberto; Orduna-Malea, Enrique; Thelwall, Mike; Delgado López-Cózar, Emilio (Elsevier, 2018-10-05)
    Despite citation counts from Google Scholar (GS), Web of Science (WoS), and Scopus being widely consulted by researchers and sometimes used in research evaluations, there is no recent or systematic evidence about the differences between them. In response, this paper investigates 2,448,055 citations to 2299 English-language highly-cited documents from 252 GS subject categories published in 2006, comparing GS, the WoS Core Collection, and Scopus. GS consistently found the largest percentage of citations across all areas (93%–96%), far ahead of Scopus (35%–77%) and WoS (27%–73%). GS found nearly all the WoS (95%) and Scopus (92%) citations. Most citations found only by GS were from non-journal sources (48%–65%), including theses, books, conference papers, and unpublished materials. Many were non-English (19%–38%), and they tended to be much less cited than citing sources that were also in Scopus or WoS. Despite the many unique GS citing sources, Spearman correlations between citation counts in GS and WoS or Scopus are high (0.78-0.99). They are lower in the Humanities, and lower between GS and WoS than between GS and Scopus. The results suggest that in all areas GS citation data is essentially a superset of WoS and Scopus, with substantial extra coverage.
  • A flexible framework for collocation retrieval and translation from parallel and comparable corpora

    Rivera, Oscar Mendoza; Mitkov, Ruslan; Corpas Pastor, Gloria (John Benjamins, 2018)
    This paper outlines a methodology and a system for collocation retrieval and translation from parallel and comparable corpora. The methodology was developed with translators and language learners in mind. It is based on a phraseology framework, applies statistical techniques, and employs source tools and online resources. The collocation retrieval and translation has proved successful for English and Spanish and can be easily adapted to other languages. The evaluation results are promising and future goals are proposed. Furthermore, conclusions are drawn on the nature of comparable corpora and how they can be better exploited to suit particular needs of target users.
  • Dissecting tweets in search of irony

    Rohanian, Omid; Taslimipoor, Shiva; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics, 2018-06-05)
    This paper describes the systems submitted to SemEval 2018 Task 3 “Irony detection in English tweets” for both subtasks A and B. The first system leveraging a combination of sentiment, distributional semantic, and text surface features is ranked third among 44 teams according to the official leaderboard of the subtask A. The second system with slightly different representation of the features ranked ninth in subtask B. We present a method that entails decomposing tweets into separate parts. Searching for contrast within the constituents of a tweet is an integral part of our system. We embrace an extensive definition of contrast which leads to a vast coverage in detecting ironic content.
  • Does female-authored research have more educational impact than male-authored research?

    Thelwall, Mike (Levy Library Press, 2018-10-04)
    Female academics are more likely to be in teaching-related roles in some countries, including the USA. As a side effect of this, female-authored journal articles may tend to be more useful for students. This study assesses this hypothesis by investigating whether female first-authored research has more uptake in education than male first-authored research. Based on an analysis of Mendeley readers of articles from 2014 in five countries and 100 narrow Scopus subject categories, the results show that female-authored articles attract more student readers than male-authored articles in Spain, Turkey, the UK and USA but not India. They also attract fewer professorial readers in Spain, the UK and the USA, but not India and Turkey, and tend to be less popular with senior academics. Because the results are based on analysis of differences within narrow fields they cannot be accounted for by females working in more education-related disciplines. The apparent additional educational impact for female-authored research could be due to selecting more accessible micro-specialisms, however, such as health-related instruments within the instrumentation narrow field. Whatever the cause, the results suggest that citation-based research evaluations may undervalue the wider impact of female researchers.
  • Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

    Evans, Richard; Orasan, Constantin (Cambridge University Press, 2018)
  • Can museums find male or female audiences online with YouTube?

    Thelwall, Michael (Emerald, 2018)
    Purpose: This article investigates if and why audience gender ratios vary between museum YouTube channels, including for museums of the same type. Design/methodology/approach: Gender ratios were examined for public comments on YouTube videos from 50 popular museums in English-speaking nations. Terms that were more frequently used by males or females in comments were also examined for gender differences. Findings: The ratio of female to male YouTube commenters varies almost a hundredfold between museums. Some of the difference could be explained by gendered interests in museum themes (e.g., military, art) but others were due to the topics chosen for online content and could address a gender minority audience. Practical implications: Museums can attract new audiences online with YouTube videos that target outside their expected demographics. Originality/value: This is the first analysis of YouTube audience gender for museums.
  • Aggressive language identification using word embeddings and sentiment features

    Orasan, Constantin (Association for Computational Linguistics, 2018-06-25)
    This paper describes our participation in the First Shared Task on Aggression Identification. The method proposed relies on machine learning to identify social media texts which contain aggression. The main features employed by our method are information extracted from word embeddings and the output of a sentiment analyser. Several machine learning methods and different combinations of features were tried. The official submissions used Support Vector Machines and Random Forests. The official evaluation showed that for texts similar to the ones in the training dataset Random Forests work best, whilst for texts which are different SVMs are a better choice. The evaluation also showed that despite its simplicity the method performs well when compared with more elaborated methods.
  • Do females create higher impact research? Scopus citations and Mendeley readers for articles from five countries

    Thelwall, Mike (Elsevier, 2018-09-01)
    There are known gender imbalances in participation in scientific fields, from female dominance of nursing to male dominance of mathematics. It is not clear whether there is also a citation imbalance, with some claiming that male-authored research tends to be more cited. No previous study has assessed gender differences in the readers of academic research on a large scale, however. In response, this article assesses whether there are gender differences in the average citations and/or Mendeley readers of academic publications. Field normalised logged Scopus citations and Mendeley readers from mid-2018 for articles published in 2014 were investigated for articles with first authors from India, Spain, Turkey, the UK and the USA in up to 251 fields with at least 50 male and female authors. Although female-authored research is less cited in Turkey (−4.0%) and India (−3.6%), it is marginally more cited in Spain (0.4%), the UK (0.4%), and the USA (0.2%). Female-authored research has fewer Mendeley readers in India (−1.1%) but more in Spain (1.4%), Turkey (1.1%), the UK (2.7%) and the USA (3.0%). Thus, whilst there may be little practical gender difference in citation impact in countries with mature science systems, the higher female readership impact suggests a wider audience for female-authored research. The results also show that the conclusions from a gender analysis depend on the field normalisation method. A theoretically informed decision must therefore be made about which normalisation to use. The results also suggest that arithmetic mean-based field normalisation is favourable to males.
  • Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities

    Yaneva, Victoria; Orăsan, Constantin; Evans, Richard; Rohanian, Omid (Association for Computational Linguistics, 2017-09-08)
    Given the lack of large user-evaluated corpora in disability-related NLP research (e.g. text simplification or readability assessment for people with cognitive disabilities), the question of choosing suitable training data for NLP models is not straightforward. The use of large generic corpora may be problematic because such data may not reflect the needs of the target population. At the same time, the available user-evaluated corpora are not large enough to be used as training data. In this paper we explore a third approach, in which a large generic corpus is combined with a smaller population-specific corpus to train a classifier which is evaluated using two sets of unseen user-evaluated data. One of these sets, the ASD Comprehension corpus, is developed for the purposes of this study and made freely available. We explore the effects of the size and type of the training data used on the performance of the classifiers, and the effects of the type of the unseen test datasets on the classification performance.
  • Bilingual contexts from comparable corpora to mine for translations of collocations

    Taslimipoor, Shiva (Springer, 2018-03-21)
    Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.
  • Academic information on Twitter: A user survey

    Mohammadi, Ehsan; Thelwall, Mike; Kwasny, Mary; Holmes, Kristi L. (PLOS, 2018-05-17)
    Although counts of tweets citing academic papers are used as an informal indicator of interest, little is known about who tweets academic papers and who uses Twitter to find scholarly information. Without knowing this, it is difficult to draw useful conclusions from a publication being frequently tweeted. This study surveyed 1,912 users that have tweeted journal articles to ask about their scholarly-related Twitter uses. Almost half of the respondents (45%) did not work in academia, despite the sample probably being biased towards academics. Twitter was used most by people with a social science or humanities background. People tend to leverage social ties on Twitter to find information rather than searching for relevant tweets. Twitter is used in academia to acquire and share real-time information and to develop connections with others. Motivations for using Twitter vary by discipline, occupation, and employment sector, but not much by gender. These factors also influence the sharing of different types of academic information. This study provides evidence that Twitter plays a significant role in the discovery of scholarly information and cross-disciplinary knowledge spreading. Most importantly, the large numbers of non-academic users support the claims of those using tweet counts as evidence for the non-academic impacts of scholarly research
  • Assessing the teaching value of non-English academic books: The case of Spain

    Mas Bleda, Amalia; Thelwall, Mike (Consejo Superior de Investigaciones Científicas, 2018)
  • Leveraging large corpora for translation using the Sketch Engine

    Moze, Sarah; Krek, Simon (Cambridge University Press, 2018)
  • Co-saved, co-tweeted, and co-cited networks

    Didegah, Fereshteh; Thelwall, Mike; Danish Centre for Studies in Research & Research Policy, Department of Political Science & Government; Aarhus University; Aarhus Denmark; Statistical Cybermetrics Research Group, University of Wolverhampton, Wulfruna Street; Wolverhampton WV1 1LY UK (Wiley-Blackwell, 2018-05-14)
    Counts of tweets and Mendeley user libraries have been proposed as altmetric alternatives to citation counts for the impact assessment of articles. Although both have been investigated to discover whether they correlate with article citations, it is not known whether users tend to tweet or save (in Mendeley) the same kinds of articles that they cite. In response, this article compares pairs of articles that are tweeted, saved to a Mendeley library, or cited by the same user, but possibly a different user for each source. The study analyzes 1,131,318 articles published in 2012, with minimum tweeted (10), saved to Mendeley (100), and cited (10) thresholds. The results show surprisingly minor overall overlaps between the three phenomena. The importance of journals for Twitter and the presence of many bots at different levels of activity suggest that this site has little value for impact altmetrics. The moderate differences between patterns of saving and citation suggest that Mendeley can be used for some types of impact assessments, but sensitivity is needed for underlying differences.
  • Dimensions: A Competitor to Scopus and the Web of Science?

    Thelwall, Mike (Elsevier, 2018-08)
    Dimensions is a partly free scholarly database launched by Digital Science in January 2018. Dimensions includes journal articles and citation counts, making it a potential new source of impact data. This article explores the value of Dimensions from an impact assessment perspective with an examination of Food Science research 2008-2018 and a random sample of 10,000 Scopus articles from 2012. The results include high correlations between citation counts from Scopus and Dimensions (0.96 by narrow field in 2012) as well as similar average counts. Almost all Scopus articles with DOIs were found in Dimensions (97% in 2012). Thus, the scholarly database component of Dimensions seems to be a plausible alternative to Scopus and the Web of Science for general citation analyses and for citation data in support of some types of research evaluations.
  • Early Mendeley readers correlate with later citation counts

    Thelwall, Mike (Springer, 2018-03)
    Counts of the number of readers registered in the social reference manager Mendeley have been proposed as an early impact indicator for journal articles. Although previous research has shown that Mendeley reader counts for articles tend to have a strong positive correlation with synchronous citation counts after a few years, no previous studies have compared early Mendeley reader counts with later citation counts. In response, this first diachronic analysis compares reader counts within a month of publication with citation counts after 20 months for ten fields. There were moderate or strong correlations in eight out of ten fields, with the two exceptions being the smallest categories (n=18, 36) with wide confidence intervals. The correlations are higher than the correlations between later citations and early citations, showing that Mendeley reader counts are more useful early impact indicators than citation counts.
  • Can Microsoft Academic be used for citation analysis of preprint archives? The case of the Social Science Research Network

    Thelwall, Mike (Springer, 2018-03)
    Preprint archives play an important scholarly communication role within some fields. The impact of archives and individual preprints are difficult to analyse because online repositories are not indexed by the Web of Science or Scopus. In response, this article assesses whether the new Microsoft Academic can be used for citation analysis of preprint archives, focusing on the Social Science Research Network (SSRN). Although Microsoft Academic seems to index SSRN comprehensively, it groups a small fraction of SSRN papers into an easily retrievable set that has variations in character over time, making any field normalisation or citation comparisons untrustworthy. A brief parallel analysis of arXiv suggests that similar results would occur for other online repositories. Systematic analyses of preprint archives are nevertheless possible with Microsoft Academic when complete lists of archive publications are available from other sources because of its promising coverage and citation results.
  • A comparison of title words for journal articles and Wikipedia pages: Coverage and stylistic differences?

    Thelwall, Mike; Sud, Pardeep (La Fundación Española para la Ciencia y la Tecnología (FECYT), 2018-02-12)
    This article assesses whether there are gaps in Wikipedia’s coverage of academic information and whether there are non-obvious stylistic differences from academic journal articles that Wikipedia users and editors should be aware of. For this, it analyses terms in the titles of journal articles that are absent from all English Wikipedia page titles for each of 27 Scopus subject categories. The results show that English Wikipedia has lower coverage of issues of interest to non-English nations and there are gaps probably caused by a lack of willing subject specialist editors in some areas. There were also stylistic disciplinary differences in the results, with some fields using synonyms of “analysing” that were ignored in Wikipedia, and others using the present tense in titles to emphasise research outcomes. Since Wikipedia is broadly effective at covering academic research topics from all disciplines, it might be relied upon by non-specialists. Specialists should therefore check for coverage gaps within their areas for useful topics and librarians should caution users that important topics may be missing.
  • Differences between journals and years in the proportions of students, researchers and faculty registering Mendeley articles

    Thelwall, Mike (Springer, 2018-07)
    This article contains two investigations into Mendeley reader counts with the same dataset. Mendeley reader counts provide evidence of early scholarly impact for journal articles, but reflect the reading of a relatively young subset of all researchers. To investigate whether this age bias is constant or varies by narrow field and publication year, this article compares the proportions of student, researcher and faculty readers for articles published 1996-2016 in 36 large monodisciplinary journals. In these journals, undergraduates recorded the newest research and faculty the oldest, with large differences between journals. The existence of substantial differences in the composition of readers between related fields points to the need for caution when using Mendeley readers as substitutes for citations for broad fields. The second investigation shows, with the same data, that there are substantial differences between narrow fields in the time taken for Scopus citations to be as numerous as Mendeley readers. Thus, even narrow field differences can impact on the relative value of Mendeley compared to citation counts.
  • Could scientists use Altmetric.com scores to predict longer term citation counts?

    Thelwall, Mike; Nevill, Tamara (Elsevier, 2018-02)
    Altmetrics from Altmetric.com are widely used by publishers and researchers to give earlier evidence of attention than citation counts. This article assesses whether Altmetric.com scores are reliable early indicators of likely future impact and whether they may also reflect non-scholarly impacts. A preliminary factor analysis suggests that the main altmetric indicator of scholarly impact is Mendeley reader counts, with weaker news, informational and social network discussion/promotion dimensions in some fields. Based on a regression analysis of Altmetric.com data from November 2015 and Scopus citation counts from October 2017 for articles in 30 narrow fields, only Mendeley reader counts are consistent predictors of future citation impact. Most other Altmetric.com scores can help predict future impact in some fields. Overall, the results confirm that early Altmetric.com scores can predict later citation counts, although less well than journal impact factors, and the optimal strategy is to consider both Altmetric.com scores and journal impact factors. Altmetric.com scores can also reflect dimensions of non-scholarly impact in some fields.

View more