• Identification of multiword expressions: A fresh look at modelling and evaluation

      Taslimipoor, Shiva; Rohanian, Omid; Mitkov, Ruslan; Fazly, Afsaneh (Language Science Press, 2018-10-25)
    • The influence of highly cited papers on field normalised indicators

      Thelwall, Mike (Springer, 2019-01-05)
      Field normalised average citation indicators are widely used to compare countries, universities and research groups. The most common variant, the Mean Normalised Citation Score (MNCS), is known to be sensitive to individual highly cited articles but the extent to which this is true for a log-based alternative, the Mean Normalised Log Citation Score (MNLCS), is unknown. This article investigates country-level highly cited outliers for MNLCS and MNCS for all Scopus articles from 2013 and 2012. The results show that MNLCS is influenced by outliers, as measured by kurtosis, but at a much lower level than MNCS. The largest outliers were affected by the journal classifications, with the Science-Metrix scheme producing much weaker outliers than the internal Scopus scheme. The high Scopus outliers were mainly due to uncitable articles reducing the average in some humanities categories. Although outliers have a numerically small influence on the outcome for individual countries, changing indicator or classification scheme influences the results enough to affect policy conclusions drawn from them. Future field normalised calculations should therefore explicitly address the influence of outliers in their methods and reporting.
    • Grammatical annotation of historical Portuguese: Generating a corpus-based diachronic dictionary

      Bick, Eckhard; Zampieri, Marcos (Springer, 2016-09-03)
      In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our method allows to create tailor-made standardization dictionaries for historical Portuguese with optional period or author frequencies.
    • Linguistic features of genre and method variation in translation: A computational perspective

      Lapshinova-Koltunski, Ekaterina; Zampieri, Marcos (Mouton De Grouter, 2018-04-09)
      In this contribution we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this study is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation.
    • Gender and research Publishing in India: Uniformly high inequality?

      Thelwall, Mike; Bailey, Carol; Makita, Meiko; Sud, Pardeep; Madalli, Devika P. (Elsevier, 2018-12-10)
      Gender inequalities have been a persistent feature of all modern societies. Although employment-related gender discrimination in various forms is legally prohibited, prejudice and violence against females have not been eradicated. Moreover, gendered social expectations can constrain the career choices of both males and females. Within academia, continuing gender imbalances have been found in many countries (Larivière, Ni, Gingras, Cronin, & Sugimoto, 2013), and particularly at senior levels (e.g., Ucal, O'Neil, & Toktas, 2015; Weisshaar, 2017; Winchester & Browning, 2015). India was the fifth largest research producer in 2017, according to Scopus, but has the highest United Nations Development Programme (UNDP) gender inequality index of the 30 largest research producers in Scopus (/hdr.undp.org/en/data) and so is an important case for global science. Moreover, the complex web of influences that have led to women being underrepresented in science in India is not well understood (Gupta, 2015). The absence of basic information about gender inequalities is a serious limitation because gender issues in India differ from the better researched case of the USA, due to economic conditions, probably stronger family influences (Vindhya, 2007), greater female safety concerns (Vindhya, 2007), and differing cultural expectations (Chandrakar, 2014).
    • Gender differences in research areas, methods and topics: Can people and thing orientations explain the results?

      Thelwall, Mike; Bailey, Carol; Tobin, Catherine; Bradshaw, Noel-Ann (Elsevier, 2019-12-31)
      Although the gender gap in academia has narrowed, females are underrepresented within some fields in the USA. Prior research suggests that the imbalances between science, technology, engineering and mathematics fields may be partly due to greater male interest in things and greater female interest in people, or to off-putting masculine cultures in some disciplines. To seek more detailed insights across all subjects, this article compares practising US male and female researchers between and within 285 narrow Scopus fields inside 26 broad fields from their first-authored articles published in 2017. The comparison is based on publishing fields and the words used in article titles, abstracts, and keywords. The results cannot be fully explained by the people/thing dimensions. Exceptions include greater female interest in veterinary science and cell biology and greater male interest in abstraction, patients, and power/control fields, such as politics and law. These may be due to other factors, such as the ability of a career to provide status or social impact or the availability of alternative careers. As a possible side effect of the partial people/thing relationship, females are more likely to use exploratory and qualitative methods and males are more likely to use quantitative methods. The results suggest that the necessary steps of eliminating explicit and implicit gender bias in academia are insufficient and might be complemented by measures to make fields more attractive to minority genders.
    • She’s Reddit: A source of statistically significant gendered interest information

      Thelwall, Mike; Stuart, Emma (Elsevier, 2018-12-31)
      Information about gender differences in interests is necessary to disentangle the effects of discrimination and choice when gender inequalities occur, such as in employment. This article assesses gender differences in interests within the popular social news and entertainment site Reddit. A method to detect terms that are statistically significantly used more by males or females in 181 million comments in 100 subreddits shows that gender affects both the selection of subreddits and activities within most of them. The method avoids the hidden gender biases of topic modelling for this task. Although the method reveals statistically significant gender differences in interests for topics that are extensively discussed on Reddit, it cannot give definitive causes, and imitation and sharing within the site mean that additional checking is needed to verify the results. Nevertheless, with care, Reddit can serve as a useful source of insights into gender differences in interests.
    • Tweeting links to academic articles

      Thelwall, M.; Tsou, A.; Weingart, S.; Haustein, S. (2013-01-01)
      Academic articles are now frequently tweeted and so Twitter seems to be a useful tool for scholars to use to help keep up with publications and discussions in their fields. Perhaps as a result of this, tweet counts are increasingly used by digital libraries and journal websites as indicators of an article's interest or impact. Nevertheless, it is not known whether tweets are typically positive, neutral or critical, or how articles are normally tweeted. These are problems for those wishing to tweet articles effectively and for those wishing to know whether tweet counts in digital libraries should be taken seriously. In response, a pilot study content analysis was conducted of 270 tweets linking to articles in four journals, four digital libraries and two DOI URLs, collected over a period of eight months in 2012. The vast majority of the tweets echoed an article title (42%) or a brief summary (41%). One reason for summarising an article seemed to be to translate it for a general audience. Few tweets explicitly praised an article and none were critical. Most tweets did not directly refer to the article author, but some did and others were clearly self-citations. In summary, tweets containing links to scholarly articles generally provide little more than publicity, and so whilst tweet counts may provide evidence of the popularity of an article, the contents of the tweets themselves are unlikely to give deep insights into scientists' reactions to publications, except perhaps in special cases.
    • Do gendered citation advantages influence field participation? Four unusual fields in the USA 1996-2017

      Thelwall, Mike (Springer, 2018-09-29)
      Gender inequalities in science are an ongoing concern, but their current causes are not well understood. This article investigates four fields with unusual proportions of female researchers in the USA for their subject matter, according to some current theories. It assesses how their gender composition and gender differences in citation rates have changed over time. All fields increased their share of female first-authored research, but at varying rates. The results give no evidence of the importance of citations, despite their unusual gender characteristics. For example, the field with the highest share of female-authored research and the most rapid increase had the largest male citation advantage. Differing micro-specialisms seems more likely than bias to be a cause of gender differences in citation rates, when present.
    • Do prestigious Spanish scholarly book publishers have more teaching impact?

      Mas-Bleda, Amalia; Thelwall, Mike (Emerald Publishing Limited, 2018-10-10)
      Purpose The purpose of this paper is to assess the educational value of prestigious and productive Spanish scholarly publishers based on mentions of their books in online scholarly syllabi. Design/methodology/approach Syllabus mentions of 15,117 books from 27 publishers were searched for, manually checked and compared with Microsoft Academic (MA) citations. Findings Most books published by Ariel, Síntesis, Tecnos and Cátedra have been mentioned in at least one online syllabus, indicating that their books have consistently high educational value. In contrast, few books published by the most productive publishers were mentioned in online syllabi. Prestigious publishers have both the highest educational impact based on syllabus mentions and the highest research impact based on MA citations. Research limitations/implications The results might be different for other publishers. The online syllabus mentions found may be a small fraction of the syllabus mentions of the sampled books. Practical implications Authors of Spanish-language social sciences and humanities books should consider general prestige when selecting a publisher if they want educational uptake for their work. Originality/value This is the first study assessing book publishers based on syllabus mentions.
    • Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories

      Martín-Martín, Alberto; Orduna-Malea, Enrique; Thelwall, Mike; Delgado López-Cózar, Emilio (Elsevier, 2018-10-05)
      Despite citation counts from Google Scholar (GS), Web of Science (WoS), and Scopus being widely consulted by researchers and sometimes used in research evaluations, there is no recent or systematic evidence about the differences between them. In response, this paper investigates 2,448,055 citations to 2299 English-language highly-cited documents from 252 GS subject categories published in 2006, comparing GS, the WoS Core Collection, and Scopus. GS consistently found the largest percentage of citations across all areas (93%–96%), far ahead of Scopus (35%–77%) and WoS (27%–73%). GS found nearly all the WoS (95%) and Scopus (92%) citations. Most citations found only by GS were from non-journal sources (48%–65%), including theses, books, conference papers, and unpublished materials. Many were non-English (19%–38%), and they tended to be much less cited than citing sources that were also in Scopus or WoS. Despite the many unique GS citing sources, Spearman correlations between citation counts in GS and WoS or Scopus are high (0.78-0.99). They are lower in the Humanities, and lower between GS and WoS than between GS and Scopus. The results suggest that in all areas GS citation data is essentially a superset of WoS and Scopus, with substantial extra coverage.
    • A flexible framework for collocation retrieval and translation from parallel and comparable corpora

      Rivera, Oscar Mendoza; Mitkov, Ruslan; Corpas Pastor, Gloria (John Benjamins, 2018)
      This paper outlines a methodology and a system for collocation retrieval and translation from parallel and comparable corpora. The methodology was developed with translators and language learners in mind. It is based on a phraseology framework, applies statistical techniques, and employs source tools and online resources. The collocation retrieval and translation has proved successful for English and Spanish and can be easily adapted to other languages. The evaluation results are promising and future goals are proposed. Furthermore, conclusions are drawn on the nature of comparable corpora and how they can be better exploited to suit particular needs of target users.
    • Dissecting tweets in search of irony

      Rohanian, Omid; Taslimipoor, Shiva; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics, 2018-06-05)
      This paper describes the systems submitted to SemEval 2018 Task 3 “Irony detection in English tweets” for both subtasks A and B. The first system leveraging a combination of sentiment, distributional semantic, and text surface features is ranked third among 44 teams according to the official leaderboard of the subtask A. The second system with slightly different representation of the features ranked ninth in subtask B. We present a method that entails decomposing tweets into separate parts. Searching for contrast within the constituents of a tweet is an integral part of our system. We embrace an extensive definition of contrast which leads to a vast coverage in detecting ironic content.
    • Does female-authored research have more educational impact than male-authored research?

      Thelwall, Mike (Levy Library Press, 2018-10-04)
      Female academics are more likely to be in teaching-related roles in some countries, including the USA. As a side effect of this, female-authored journal articles may tend to be more useful for students. This study assesses this hypothesis by investigating whether female first-authored research has more uptake in education than male first-authored research. Based on an analysis of Mendeley readers of articles from 2014 in five countries and 100 narrow Scopus subject categories, the results show that female-authored articles attract more student readers than male-authored articles in Spain, Turkey, the UK and USA but not India. They also attract fewer professorial readers in Spain, the UK and the USA, but not India and Turkey, and tend to be less popular with senior academics. Because the results are based on analysis of differences within narrow fields they cannot be accounted for by females working in more education-related disciplines. The apparent additional educational impact for female-authored research could be due to selecting more accessible micro-specialisms, however, such as health-related instruments within the instrumentation narrow field. Whatever the cause, the results suggest that citation-based research evaluations may undervalue the wider impact of female researchers.
    • Semantic discrimination based on knowledge and association

      Taslimipoor, Shiva; Rohanian, Omid; Ha, Le An; Corpas Pastor, Gloria; Mitkov, Ruslan (Association for Computational Linguistics, 2018-06)
      This paper describes the system submitted to SemEval 2018 shared task 10 ‘Capturing Discriminative Attributes’. We use a combination of knowledge-based and co-occurrence features to capture the semantic difference between two words in relation to an attribute. We define scores based on association measures, ngram counts, word similarity, and ConceptNet relations. The system is ranked 4th (joint) on the official leaderboard of the task.
    • Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

      Evans, Richard; Orasan, Constantin (Cambridge University Press, 2018-10-31)
    • Can museums find male or female audiences online with YouTube?

      Thelwall, Michael (Emerald, 2018)
      Purpose: This article investigates if and why audience gender ratios vary between museum YouTube channels, including for museums of the same type. Design/methodology/approach: Gender ratios were examined for public comments on YouTube videos from 50 popular museums in English-speaking nations. Terms that were more frequently used by males or females in comments were also examined for gender differences. Findings: The ratio of female to male YouTube commenters varies almost a hundredfold between museums. Some of the difference could be explained by gendered interests in museum themes (e.g., military, art) but others were due to the topics chosen for online content and could address a gender minority audience. Practical implications: Museums can attract new audiences online with YouTube videos that target outside their expected demographics. Originality/value: This is the first analysis of YouTube audience gender for museums.
    • Avoiding Obscure Topics and Generalising Findings Produces Higher Impact Research

      Thelwall, Mike; University of Wolverhampton (Springer Berlin / Heidelberg, 2017-09-01)
      Much academic research is never cited and may be rarely read, indicating wasted effort from the authors, referees and publishers. One reason that an article could be ignored is that its topic is, or appears to be, too obscure to be of wide interest, even if excellent scholarship produced it. This paper reports a word frequency analysis of 874,411 English article titles from 18 different Scopus natural, formal, life and health sciences categories 2009-2015 to assess the likelihood that research on obscure (rarely researched) topics is less cited. In all categories examined, unusual words in article titles associate with below average citation impact research. Thus, researchers considering obscure topics may wish to reconsider, generalise their study, or to choose a title that reflects the wider lessons that can be drawn. Authors should also consider including multiple concepts and purposes within their titles in order to attract a wider audience.
    • Aggressive language identification using word embeddings and sentiment features

      Orasan, Constantin (Association for Computational Linguistics, 2018-06-25)
      This paper describes our participation in the First Shared Task on Aggression Identification. The method proposed relies on machine learning to identify social media texts which contain aggression. The main features employed by our method are information extracted from word embeddings and the output of a sentiment analyser. Several machine learning methods and different combinations of features were tried. The official submissions used Support Vector Machines and Random Forests. The official evaluation showed that for texts similar to the ones in the training dataset Random Forests work best, whilst for texts which are different SVMs are a better choice. The evaluation also showed that despite its simplicity the method performs well when compared with more elaborated methods.
    • Do females create higher impact research? Scopus citations and Mendeley readers for articles from five countries

      Thelwall, Mike (Elsevier, 2018-09-01)
      There are known gender imbalances in participation in scientific fields, from female dominance of nursing to male dominance of mathematics. It is not clear whether there is also a citation imbalance, with some claiming that male-authored research tends to be more cited. No previous study has assessed gender differences in the readers of academic research on a large scale, however. In response, this article assesses whether there are gender differences in the average citations and/or Mendeley readers of academic publications. Field normalised logged Scopus citations and Mendeley readers from mid-2018 for articles published in 2014 were investigated for articles with first authors from India, Spain, Turkey, the UK and the USA in up to 251 fields with at least 50 male and female authors. Although female-authored research is less cited in Turkey (−4.0%) and India (−3.6%), it is marginally more cited in Spain (0.4%), the UK (0.4%), and the USA (0.2%). Female-authored research has fewer Mendeley readers in India (−1.1%) but more in Spain (1.4%), Turkey (1.1%), the UK (2.7%) and the USA (3.0%). Thus, whilst there may be little practical gender difference in citation impact in countries with mature science systems, the higher female readership impact suggests a wider audience for female-authored research. The results also show that the conclusions from a gender analysis depend on the field normalisation method. A theoretically informed decision must therefore be made about which normalisation to use. The results also suggest that arithmetic mean-based field normalisation is favourable to males.