Recent Submissions

  • Gender differences in research areas, methods and topics: Can people and thing orientations explain the results?

    Thelwall, Mike; Bailey, Carol; Tobin, Catherine; Bradshaw, Noel-Ann (Elsevier, 2019-12-31)
    Although the gender gap in academia has narrowed, females are underrepresented within some fields in the USA. Prior research suggests that the imbalances between science, technology, engineering and mathematics fields may be partly due to greater male interest in things and greater female interest in people, or to off-putting masculine cultures in some disciplines. To seek more detailed insights across all subjects, this article compares practising US male and female researchers between and within 285 narrow Scopus fields inside 26 broad fields from their first-authored articles published in 2017. The comparison is based on publishing fields and the words used in article titles, abstracts, and keywords. The results cannot be fully explained by the people/thing dimensions. Exceptions include greater female interest in veterinary science and cell biology and greater male interest in abstraction, patients, and power/control fields, such as politics and law. These may be due to other factors, such as the ability of a career to provide status or social impact or the availability of alternative careers. As a possible side effect of the partial people/thing relationship, females are more likely to use exploratory and qualitative methods and males are more likely to use quantitative methods. The results suggest that the necessary steps of eliminating explicit and implicit gender bias in academia are insufficient and might be complemented by measures to make fields more attractive to minority genders.
  • She’s Reddit: A source of statistically significant gendered interest information

    Thelwall, Mike; Stuart, Emma (Elsevier, 2018-12-31)
    Information about gender differences in interests is necessary to disentangle the effects of discrimination and choice when gender inequalities occur, such as in employment. This article assesses gender differences in interests within the popular social news and entertainment site Reddit. A method to detect terms that are statistically significantly used more by males or females in 181 million comments in 100 subreddits shows that gender affects both the selection of subreddits and activities within most of them. The method avoids the hidden gender biases of topic modelling for this task. Although the method reveals statistically significant gender differences in interests for topics that are extensively discussed on Reddit, it cannot give definitive causes, and imitation and sharing within the site mean that additional checking is needed to verify the results. Nevertheless, with care, Reddit can serve as a useful source of insights into gender differences in interests.
  • Do gendered citation advantages influence field participation? Four unusual fields in the USA 1996-2017

    Thelwall, Mike (Springer, 2018-09-29)
    Gender inequalities in science are an ongoing concern, but their current causes are not well understood. This article investigates four fields with unusual proportions of female researchers in the USA for their subject matter, according to some current theories. It assesses how their gender composition and gender differences in citation rates have changed over time. All fields increased their share of female first-authored research, but at varying rates. The results give no evidence of the importance of citations, despite their unusual gender characteristics. For example, the field with the highest share of female-authored research and the most rapid increase had the largest male citation advantage. Differing micro-specialisms seems more likely than bias to be a cause of gender differences in citation rates, when present.
  • Do prestigious Spanish scholarly book publishers have more teaching impact?

    Mas-Bleda, Amalia; Thelwall, Mike (Emerald Publishing Limited, 2018-10-10)
    Purpose The purpose of this paper is to assess the educational value of prestigious and productive Spanish scholarly publishers based on mentions of their books in online scholarly syllabi. Design/methodology/approach Syllabus mentions of 15,117 books from 27 publishers were searched for, manually checked and compared with Microsoft Academic (MA) citations. Findings Most books published by Ariel, Síntesis, Tecnos and Cátedra have been mentioned in at least one online syllabus, indicating that their books have consistently high educational value. In contrast, few books published by the most productive publishers were mentioned in online syllabi. Prestigious publishers have both the highest educational impact based on syllabus mentions and the highest research impact based on MA citations. Research limitations/implications The results might be different for other publishers. The online syllabus mentions found may be a small fraction of the syllabus mentions of the sampled books. Practical implications Authors of Spanish-language social sciences and humanities books should consider general prestige when selecting a publisher if they want educational uptake for their work. Originality/value This is the first study assessing book publishers based on syllabus mentions.
  • Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories

    Martín-Martín, Alberto; Orduna-Malea, Enrique; Thelwall, Mike; Delgado López-Cózar, Emilio (Elsevier, 2018-10-05)
    Despite citation counts from Google Scholar (GS), Web of Science (WoS), and Scopus being widely consulted by researchers and sometimes used in research evaluations, there is no recent or systematic evidence about the differences between them. In response, this paper investigates 2,448,055 citations to 2299 English-language highly-cited documents from 252 GS subject categories published in 2006, comparing GS, the WoS Core Collection, and Scopus. GS consistently found the largest percentage of citations across all areas (93%–96%), far ahead of Scopus (35%–77%) and WoS (27%–73%). GS found nearly all the WoS (95%) and Scopus (92%) citations. Most citations found only by GS were from non-journal sources (48%–65%), including theses, books, conference papers, and unpublished materials. Many were non-English (19%–38%), and they tended to be much less cited than citing sources that were also in Scopus or WoS. Despite the many unique GS citing sources, Spearman correlations between citation counts in GS and WoS or Scopus are high (0.78-0.99). They are lower in the Humanities, and lower between GS and WoS than between GS and Scopus. The results suggest that in all areas GS citation data is essentially a superset of WoS and Scopus, with substantial extra coverage.
  • A flexible framework for collocation retrieval and translation from parallel and comparable corpora

    Rivera, Oscar Mendoza; Mitkov, Ruslan; Corpas Pastor, Gloria (John Benjamins, 2018)
    This paper outlines a methodology and a system for collocation retrieval and translation from parallel and comparable corpora. The methodology was developed with translators and language learners in mind. It is based on a phraseology framework, applies statistical techniques, and employs source tools and online resources. The collocation retrieval and translation has proved successful for English and Spanish and can be easily adapted to other languages. The evaluation results are promising and future goals are proposed. Furthermore, conclusions are drawn on the nature of comparable corpora and how they can be better exploited to suit particular needs of target users.
  • Dissecting tweets in search of irony

    Rohanian, Omid; Taslimipoor, Shiva; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics, 2018-06-05)
    This paper describes the systems submitted to SemEval 2018 Task 3 “Irony detection in English tweets” for both subtasks A and B. The first system leveraging a combination of sentiment, distributional semantic, and text surface features is ranked third among 44 teams according to the official leaderboard of the subtask A. The second system with slightly different representation of the features ranked ninth in subtask B. We present a method that entails decomposing tweets into separate parts. Searching for contrast within the constituents of a tweet is an integral part of our system. We embrace an extensive definition of contrast which leads to a vast coverage in detecting ironic content.
  • Does female-authored research have more educational impact than male-authored research?

    Thelwall, Mike (Levy Library Press, 2018-10-04)
    Female academics are more likely to be in teaching-related roles in some countries, including the USA. As a side effect of this, female-authored journal articles may tend to be more useful for students. This study assesses this hypothesis by investigating whether female first-authored research has more uptake in education than male first-authored research. Based on an analysis of Mendeley readers of articles from 2014 in five countries and 100 narrow Scopus subject categories, the results show that female-authored articles attract more student readers than male-authored articles in Spain, Turkey, the UK and USA but not India. They also attract fewer professorial readers in Spain, the UK and the USA, but not India and Turkey, and tend to be less popular with senior academics. Because the results are based on analysis of differences within narrow fields they cannot be accounted for by females working in more education-related disciplines. The apparent additional educational impact for female-authored research could be due to selecting more accessible micro-specialisms, however, such as health-related instruments within the instrumentation narrow field. Whatever the cause, the results suggest that citation-based research evaluations may undervalue the wider impact of female researchers.
  • Semantic discrimination based on knowledge and association

    Taslimipoor, Shiva; Rohanian, Omid; Ha, Le An; Corpas Pastor, Gloria; Mitkov, Ruslan (Association for Computational Linguistics, 2018-06)
    This paper describes the system submitted to SemEval 2018 shared task 10 ‘Capturing Discriminative Attributes’. We use a combination of knowledge-based and co-occurrence features to capture the semantic difference between two words in relation to an attribute. We define scores based on association measures, ngram counts, word similarity, and ConceptNet relations. The system is ranked 4th (joint) on the official leaderboard of the task.
  • Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

    Evans, Richard; Orasan, Constantin (Cambridge University Press, 2018)
  • Can museums find male or female audiences online with YouTube?

    Thelwall, Michael (Emerald, 2018)
    Purpose: This article investigates if and why audience gender ratios vary between museum YouTube channels, including for museums of the same type. Design/methodology/approach: Gender ratios were examined for public comments on YouTube videos from 50 popular museums in English-speaking nations. Terms that were more frequently used by males or females in comments were also examined for gender differences. Findings: The ratio of female to male YouTube commenters varies almost a hundredfold between museums. Some of the difference could be explained by gendered interests in museum themes (e.g., military, art) but others were due to the topics chosen for online content and could address a gender minority audience. Practical implications: Museums can attract new audiences online with YouTube videos that target outside their expected demographics. Originality/value: This is the first analysis of YouTube audience gender for museums.
  • Aggressive language identification using word embeddings and sentiment features

    Orasan, Constantin (Association for Computational Linguistics, 2018-06-25)
    This paper describes our participation in the First Shared Task on Aggression Identification. The method proposed relies on machine learning to identify social media texts which contain aggression. The main features employed by our method are information extracted from word embeddings and the output of a sentiment analyser. Several machine learning methods and different combinations of features were tried. The official submissions used Support Vector Machines and Random Forests. The official evaluation showed that for texts similar to the ones in the training dataset Random Forests work best, whilst for texts which are different SVMs are a better choice. The evaluation also showed that despite its simplicity the method performs well when compared with more elaborated methods.
  • Do females create higher impact research? Scopus citations and Mendeley readers for articles from five countries

    Thelwall, Mike (Elsevier, 2018-09-01)
    There are known gender imbalances in participation in scientific fields, from female dominance of nursing to male dominance of mathematics. It is not clear whether there is also a citation imbalance, with some claiming that male-authored research tends to be more cited. No previous study has assessed gender differences in the readers of academic research on a large scale, however. In response, this article assesses whether there are gender differences in the average citations and/or Mendeley readers of academic publications. Field normalised logged Scopus citations and Mendeley readers from mid-2018 for articles published in 2014 were investigated for articles with first authors from India, Spain, Turkey, the UK and the USA in up to 251 fields with at least 50 male and female authors. Although female-authored research is less cited in Turkey (−4.0%) and India (−3.6%), it is marginally more cited in Spain (0.4%), the UK (0.4%), and the USA (0.2%). Female-authored research has fewer Mendeley readers in India (−1.1%) but more in Spain (1.4%), Turkey (1.1%), the UK (2.7%) and the USA (3.0%). Thus, whilst there may be little practical gender difference in citation impact in countries with mature science systems, the higher female readership impact suggests a wider audience for female-authored research. The results also show that the conclusions from a gender analysis depend on the field normalisation method. A theoretically informed decision must therefore be made about which normalisation to use. The results also suggest that arithmetic mean-based field normalisation is favourable to males.
  • Which US and European Higher Education Institutions are visible in ResearchGate and what affects their RG Score?

    Lepori, Benedetto; Thelwall, Michael; Hoorani, Bareerah Hafeez (Elsevier, 2018-07-19)
    While ResearchGate has become the most popular academic social networking site in terms of regular users, not all institutions have joined and the scores it assigns to academics and institutions are controversial. This paper assesses the presence in ResearchGate of higher education institutions in Europe and the US in 2017, and the extent to which institutional ResearchGate Scores reflect institutional academic impact. Most of the 2258 European and 4355 US higher educational institutions included in the sample had an institutional ResearchGate profile, with near universal coverage for PhD-awarding institutions found in the Web of Science (WoS). For non-PhD awarding institutions that did not publish, size (number of staff members) was most associated with presence in ResearchGate. For PhD-awarding institutions in WoS, presence in RG was strongly related to the number of WoS publications. In conclusion, a) institutional RG scores reflect research volume more than visibility and b) this indicator is highly correlated to the number of WoS publications. Hence, the value of RG Scores for institutional comparisons is limited.
  • Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities

    Yaneva, Victoria; Orăsan, Constantin; Evans, Richard; Rohanian, Omid (Association for Computational Linguistics, 2017-09-08)
    Given the lack of large user-evaluated corpora in disability-related NLP research (e.g. text simplification or readability assessment for people with cognitive disabilities), the question of choosing suitable training data for NLP models is not straightforward. The use of large generic corpora may be problematic because such data may not reflect the needs of the target population. At the same time, the available user-evaluated corpora are not large enough to be used as training data. In this paper we explore a third approach, in which a large generic corpus is combined with a smaller population-specific corpus to train a classifier which is evaluated using two sets of unseen user-evaluated data. One of these sets, the ASD Comprehension corpus, is developed for the purposes of this study and made freely available. We explore the effects of the size and type of the training data used on the performance of the classifiers, and the effects of the type of the unseen test datasets on the classification performance.
  • Bilingual contexts from comparable corpora to mine for translations of collocations

    Taslimipoor, Shiva (Springer, 2018-03-21)
    Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.
  • Academic information on Twitter: A user survey

    Mohammadi, Ehsan; Thelwall, Mike; Kwasny, Mary; Holmes, Kristi L. (PLOS, 2018-05-17)
    Although counts of tweets citing academic papers are used as an informal indicator of interest, little is known about who tweets academic papers and who uses Twitter to find scholarly information. Without knowing this, it is difficult to draw useful conclusions from a publication being frequently tweeted. This study surveyed 1,912 users that have tweeted journal articles to ask about their scholarly-related Twitter uses. Almost half of the respondents (45%) did not work in academia, despite the sample probably being biased towards academics. Twitter was used most by people with a social science or humanities background. People tend to leverage social ties on Twitter to find information rather than searching for relevant tweets. Twitter is used in academia to acquire and share real-time information and to develop connections with others. Motivations for using Twitter vary by discipline, occupation, and employment sector, but not much by gender. These factors also influence the sharing of different types of academic information. This study provides evidence that Twitter plays a significant role in the discovery of scholarly information and cross-disciplinary knowledge spreading. Most importantly, the large numbers of non-academic users support the claims of those using tweet counts as evidence for the non-academic impacts of scholarly research
  • Assessing the teaching value of non-English academic books: The case of Spain

    Mas Bleda, Amalia; Thelwall, Mike (Consejo Superior de Investigaciones Científicas, 2018-12-01)
  • Leveraging large corpora for translation using the Sketch Engine

    Moze, Sarah; Krek, Simon (Cambridge University Press, 2018)
  • Co-saved, co-tweeted, and co-cited networks

    Didegah, Fereshteh; Thelwall, Mike; Danish Centre for Studies in Research & Research Policy, Department of Political Science & Government; Aarhus University; Aarhus Denmark; Statistical Cybermetrics Research Group, University of Wolverhampton, Wulfruna Street; Wolverhampton WV1 1LY UK (Wiley-Blackwell, 2018-05-14)
    Counts of tweets and Mendeley user libraries have been proposed as altmetric alternatives to citation counts for the impact assessment of articles. Although both have been investigated to discover whether they correlate with article citations, it is not known whether users tend to tweet or save (in Mendeley) the same kinds of articles that they cite. In response, this article compares pairs of articles that are tweeted, saved to a Mendeley library, or cited by the same user, but possibly a different user for each source. The study analyzes 1,131,318 articles published in 2012, with minimum tweeted (10), saved to Mendeley (100), and cited (10) thresholds. The results show surprisingly minor overall overlaps between the three phenomena. The importance of journals for Twitter and the presence of many bots at different levels of activity suggest that this site has little value for impact altmetrics. The moderate differences between patterns of saving and citation suggest that Mendeley can be used for some types of impact assessments, but sensitivity is needed for underlying differences.

View more