Recent Submissions

  • Using gaze data to predict multiword expressions

    Rohanian, Omid; Taslimipoor, Shiva; Yaneva, Victoria; Ha, Le An (INCOMA Ltd, 2017-09-01)
    In recent years gaze data has been increasingly used to improve and evaluate NLP models due to the fact that it carries information about the cognitive processing of linguistic phenomena. In this paper we conduct a preliminary study towards the automatic identification of multiword expressions based on gaze features from native and non-native speakers of English. We report comparisons between a part-ofspeech (POS) and frequency baseline to: i) a prediction model based solely on gaze data and ii) a combined model of gaze data, POS and frequency. In spite of the challenging nature of the task, best performance was achieved by the latter. Furthermore, we explore how the type of gaze data (from native versus non-native speakers) affects the prediction, showing that data from the two groups is discriminative to an equal degree. Finally, we show that late processing measures are more predictive than early ones, which is in line with previous research on idioms and other formulaic structures.
  • Should citations be counted separately from each originating section?

    Thelwall, Mike (Elsevier, 2019-04-03)
    Articles are cited for different purposes and differentiating between reasons when counting citations may therefore give finer-grained citation count information. Although identifying and aggregating the individual reasons for each citation may be impractical, recording the number of citations that originate from different article sections might illuminate the general reasons behind a citation count (e.g., 110 citations = 10 Introduction citations + 100 Methods citations). To help investigate whether this could be a practical and universal solution, this article compares 19 million citations with DOIs from six different standard sections in 799,055 PubMed Central open access articles across 21 out of 22 fields. There are apparently non-systematic differences between fields in the most citing sections and the extent to which citations from one section overlap with citations from another, with some degree of overlap in most cases. Thus, at a science-wide level, section headings are partly unreliable indicators of citation context, even if they are more standard within individual fields. They may still be used within fields to help identify individual highly cited articles that have had one type of impact, especially methodological (Methods) or context setting (Introduction), but expert judgement is needed to validate the results.
  • The rhetorical structure of science? A multidisciplinary analysis of article headings

    Thelwall, Mike (Elsevier, 2019-03-19)
    An effective structure helps an article to convey its core message. The optimal structure depends on the information to be conveyed and the expectations of the audience. In the current increasingly interdisciplinary era, structural norms can be confusing to the authors, reviewers and audiences of scientific articles. Despite this, no prior study has attempted to assess variations in the structure of academic papers across all disciplines. This article reports on the headings commonly used by over 1 million research articles from the PubMed Central Open Access collection, spanning 22 broad categories covering all academia and 172 out of 176 narrow categories. The results suggest that no headings are close to ubiquitous in any broad field and that there are substantial differences in the extent to which most headings are used. In the humanities, headings may be avoided altogether. Researchers should therefore be aware of unfamiliar structures that are nevertheless legitimate when reading, writing and reviewing articles.
  • FGFR1 expression and role in migration in low and high grade pediatric gliomas

    Egbivwie, Naomi; Cockle, Julia V.; Humphries, Matthew; Ismail, Azzam; Esteves, Filomena; Taylor, Claire; Karakoula, Katherine; Morton, Ruth; Warr, Tracy; Short, Susan C.; Brüning-Richardson, Anke (Frontiers Media, 2019-03-13)
    The heterogeneous and invasive nature of pediatric gliomas poses significant treatment challenges, highlighting the importance of identifying novel chemotherapeutic targets. Recently, recurrent Fibroblast growth factor receptor 1 (FGFR1) mutations in pediatric gliomas have been reported. Here, we explored the clinical relevance of FGFR1 expression, cell migration in low and high grade pediatric gliomas and the role of FGFR1 in cell migration/invasion as a potential chemotherapeutic target. A high density tissue microarray (TMA) was used to investigate associations between FGFR1 and activated phosphorylated FGFR1 (pFGFR1) expression and various clinicopathologic parameters. Expression of FGFR1 and pFGFR1 were measured by immunofluorescence and by immunohistochemistry (IHC) in 3D spheroids in five rare patient-derived pediatric low-grade glioma (pLGG) and two established high-grade glioma (pHGG) cell lines. Two-dimensional (2D) and three-dimensional (3D) migration assays were performed for migration and inhibitor studies with three FGFR1 inhibitors. High FGFR1 expression was associated with age, malignancy, tumor location and tumor grade among astrocytomas. Membranous pFGFR1 was associated with malignancy and tumor grade. All glioma cell lines exhibited varying levels of FGFR1 and pFGFR1 expression and migratory phenotypes. There were significant anti-migratory effects on the pHGG cell lines with inhibitor treatment and anti-migratory or pro-migratory responses to FGFR1 inhibition in the pLGGs. Our findings support further research to target FGFR1 signaling in pediatric gliomas.
  • The way to analyse ‘way’: A case study in word-specific local grammar

    Hanks, Patrick; Može, Sara (Oxford Academic, 2019-02-11)
    Traditionally, dictionaries are meaning-driven—that is, they list different senses (or supposed senses) of each word, but do not say much about the phraseology that distinguishes one sense from another. Grammars, on the other hand, are structure-driven: they attempt to describe all possible structures of a language, but say little about meaning, phraseology, or collocation. In both disciplines during the 20th century, the practice of inventing evidence rather than discovering it led to intermittent and unpredictable distortions of fact. Since 1987, attempts have been made in both lexicography (Cobuild) and syntactic theory (pattern grammar, construction grammar) to integrate meaning and phraseology. Corpora now provide empirical evidence on a large scale for lexicosyntactic description, but there is still a long way to go. Many cherished beliefs must be abandoned before a synthesis between empirical lexical analysis and grammatical theory can be achieved. In this paper, by empirical analysis of just one word (the noun way), we show how corpus evidence can be used to tackle the complexities of lexical and constructional meaning, providing new insights into the lexis-grammar interface.
  • Effects of lexical properties on viewing time per word in autistic and neurotypical readers

    Štajner, Sanja; Yaneva, Victoria; Mitkov, Ruslan; Ponzetto, Simone Paolo (Association of Computational Linguistics, 2017-09-08)
    Eye tracking studies from the past few decades have shaped the way we think of word complexity and cognitive load: words that are long, rare and ambiguous are more difficult to read. However, online processing techniques have been scarcely applied to investigating the reading difficulties of people with autism and what vocabulary is challenging for them. We present parallel gaze data obtained from adult readers with autism and a control group of neurotypical readers and show that the former required higher cognitive effort to comprehend the texts as evidenced by three gaze-based measures. We divide all words into four classes based on their viewing times for both groups and investigate the relationship between longer viewing times and word length, word frequency, and four cognitively-based measures (word concreteness, familiarity, age of acquisition and imagability).
  • Classifying referential and non-referential it using gaze

    Yaneva, Victoria; Ha, Le An; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics, 2018-10-31)
    When processing a text, humans and machines must disambiguate between different uses of the pronoun it, including non-referential, nominal anaphoric or clause anaphoric ones. In this paper, we use eye-tracking data to learn how humans perform this disambiguation. We use this knowledge to improve the automatic classification of it. We show that by using gaze data and a POS-tagger we are able to significantly outperform a common baseline and classify between three categories of it with an accuracy comparable to that of linguisticbased approaches. In addition, the discriminatory power of specific gaze features informs the way humans process the pronoun, which, to the best of our knowledge, has not been explored using data from a natural reading task.
  • The reading background of Goodreads book club members: A female fiction canon?

    Thelwall, Mike; Bourrier, Karen (Emerald, 2019-12-31)
    Purpose - Despite the social, educational and therapeutic benefits of book clubs, little is known about which books participants are likely to have read. In response, this article investigates the public bookshelves of those that have joined a group within the Goodreads social network site. Design/methodology/approach – Books listed as read by members of fifty large English language Goodreads groups - with a genre focus or other theme - were compiled by author and title. Findings – Recent and youth-oriented fiction dominate the fifty books most read by book club members, while almost half are works of literature frequently taught at the secondary and postsecondary level (literary classics). Whilst JK Rowling is almost ubiquitous (at least 63% as frequently listed as other authors in any group, including groups for other genres), most authors, including Shakespeare (15%), Goulding (6%) and Hemmingway (9%), are little read by some groups. Nor are individual recent literary prize-winners or works in languages other than English frequently read. Research limitations/implications – Although these results are derived from a single popular website, knowing more about what book club members are likely to have read should help participants, organisers and moderators. For example, recent literary prize winners might be a good choice, given that few members may have read them. Originality/value – This is the first large scale study of book group members’ reading patterns. Whilst typical reading is likely to vary by group theme and average age, there seems to be a mainly female canon of about 14 authors and 19 books that Goodreads book club members are likely to have read.
  • Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

    Kousha, Kayvan; Thelwall, Mike (Elsevier, 2019-03-11)
    Dissertations can be the single most important scholarly outputs of junior researchers. Whilst sets of journal articles are often evaluated with the help of citation counts from the Web of Science or Scopus, these do not index dissertations and so their impact is hard to assess. In response, this article introduces a new multistage method to extract Google Scholar citation counts for large collections of dissertations from repositories indexed by Google. The method was used to extract Google Scholar citation counts for 77,884 American doctoral dissertations from 2013 to 2017 via ProQuest, with a precision of over 95%. Some ProQuest dissertations that were dual indexed with other repositories could not be retrieved with ProQuest-specific searches but could be found with Google Scholar searches of the other repositories. The Google Scholar citation counts were then compared with Mendeley reader counts, a known source of scholarly-like impact data. A fifth of the dissertations had at least one citation recorded in Google Scholar and slightly fewer had at least one Mendeley reader. Based on numerical comparisons, the Mendeley reader counts seem to be more useful for impact assessment purposes for dissertations that are less than two years old, whilst Google Scholar citations are more useful for older dissertations, especially in social sciences, arts and humanities. Google Scholar citation counts may reflect a more scholarly type of impact than that of Mendeley reader counts because dissertations attract a substantial minority of their citations from other dissertations. In summary, the new method now makes it possible for research funders, institutions and others to systematically evaluate the impact of dissertations, although additional Google Scholar queries for other online repositories are needed to ensure comprehensive coverage.
  • Six good predictors of autistic reading comprehension

    Yaneva, Victoria; Evans, Richard (INCOMA Ltd, 2015-09-07)
    This paper presents our investigation of the ability of 33 readability indices to account for the reading comprehension difficulty posed by texts for people with autism. The evaluation by autistic readers of 16 text passages is described, a process which led to the production of the first text collection for which readability has been evaluated by people with autism. We present the findings of a study to determine which of the 33 indices can successfully discriminate between the difficulty levels of the text passages, as determined by our reading experiment involving autistic participants. The discriminatory power of the indices is further assessed through their application to the FIRST corpus which consists of 25 texts presented in their original form and in a manually simplified form (50 texts in total), produced specifically for readers with autism.
  • Web users with autism: eye tracking evidence for differences

    Eraslan, Sukru; Yaneva, Victoria; Yesilada, Yeliz; Harper, Simon (Taylor and Francis, 2018-12-11)
    Anecdotal evidence suggests that people with autism may have different processing strategies when accessing the web. However, limited empirical evidence is available to support this. This paper presents an eye tracking study with 18 participants with high-functioning autism and 18 neurotypical participants to investigate the similarities and differences between these two groups in terms of how they search for information within web pages. According to our analysis, people with autism are likely to be less successful in completing their searching tasks. They also have a tendency to look at more elements on web pages and make more transitions between the elements in comparison to neurotypical people. In addition, they tend to make shorter but more frequent fixations on elements which are not directly related to a given search task. Therefore, this paper presents the first empirical study to investigate how people with autism differ from neurotypical people when they search for information within web pages based on an in-depth statistical analysis of their gaze patterns.
  • Identification of multiword expressions: A fresh look at modelling and evaluation

    Taslimipoor, Shiva; Rohanian, Omid; Mitkov, Ruslan; Fazly, Afsaneh (Language Science Press, 2018-10-25)
  • The influence of highly cited papers on field normalised indicators

    Thelwall, Mike (Springer, 2019-01-05)
    Field normalised average citation indicators are widely used to compare countries, universities and research groups. The most common variant, the Mean Normalised Citation Score (MNCS), is known to be sensitive to individual highly cited articles but the extent to which this is true for a log-based alternative, the Mean Normalised Log Citation Score (MNLCS), is unknown. This article investigates country-level highly cited outliers for MNLCS and MNCS for all Scopus articles from 2013 and 2012. The results show that MNLCS is influenced by outliers, as measured by kurtosis, but at a much lower level than MNCS. The largest outliers were affected by the journal classifications, with the Science-Metrix scheme producing much weaker outliers than the internal Scopus scheme. The high Scopus outliers were mainly due to uncitable articles reducing the average in some humanities categories. Although outliers have a numerically small influence on the outcome for individual countries, changing indicator or classification scheme influences the results enough to affect policy conclusions drawn from them. Future field normalised calculations should therefore explicitly address the influence of outliers in their methods and reporting.
  • Grammatical annotation of historical Portuguese: Generating a corpus-based diachronic dictionary

    Bick, Eckhard; Zampieri, Marcos (Springer, 2016-09-03)
    In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our method allows to create tailor-made standardization dictionaries for historical Portuguese with optional period or author frequencies.
  • Linguistic features of genre and method variation in translation: A computational perspective

    Lapshinova-Koltunski, Ekaterina; Zampieri, Marcos (Mouton De Grouter, 2018-04-09)
    In this contribution we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this study is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation.
  • Gender and research Publishing in India: Uniformly high inequality?

    Thelwall, Mike; Bailey, Carol; Makita, Meiko; Sud, Pardeep; Madalli, Devika P. (Elsevier, 2018-12-10)
    Gender inequalities have been a persistent feature of all modern societies. Although employment-related gender discrimination in various forms is legally prohibited, prejudice and violence against females have not been eradicated. Moreover, gendered social expectations can constrain the career choices of both males and females. Within academia, continuing gender imbalances have been found in many countries (Larivière, Ni, Gingras, Cronin, & Sugimoto, 2013), and particularly at senior levels (e.g., Ucal, O'Neil, & Toktas, 2015; Weisshaar, 2017; Winchester & Browning, 2015). India was the fifth largest research producer in 2017, according to Scopus, but has the highest United Nations Development Programme (UNDP) gender inequality index of the 30 largest research producers in Scopus (/hdr.undp.org/en/data) and so is an important case for global science. Moreover, the complex web of influences that have led to women being underrepresented in science in India is not well understood (Gupta, 2015). The absence of basic information about gender inequalities is a serious limitation because gender issues in India differ from the better researched case of the USA, due to economic conditions, probably stronger family influences (Vindhya, 2007), greater female safety concerns (Vindhya, 2007), and differing cultural expectations (Chandrakar, 2014).
  • Gender differences in research areas, methods and topics: Can people and thing orientations explain the results?

    Thelwall, Mike; Bailey, Carol; Tobin, Catherine; Bradshaw, Noel-Ann (Elsevier, 2019-12-31)
    Although the gender gap in academia has narrowed, females are underrepresented within some fields in the USA. Prior research suggests that the imbalances between science, technology, engineering and mathematics fields may be partly due to greater male interest in things and greater female interest in people, or to off-putting masculine cultures in some disciplines. To seek more detailed insights across all subjects, this article compares practising US male and female researchers between and within 285 narrow Scopus fields inside 26 broad fields from their first-authored articles published in 2017. The comparison is based on publishing fields and the words used in article titles, abstracts, and keywords. The results cannot be fully explained by the people/thing dimensions. Exceptions include greater female interest in veterinary science and cell biology and greater male interest in abstraction, patients, and power/control fields, such as politics and law. These may be due to other factors, such as the ability of a career to provide status or social impact or the availability of alternative careers. As a possible side effect of the partial people/thing relationship, females are more likely to use exploratory and qualitative methods and males are more likely to use quantitative methods. The results suggest that the necessary steps of eliminating explicit and implicit gender bias in academia are insufficient and might be complemented by measures to make fields more attractive to minority genders.
  • She’s Reddit: A source of statistically significant gendered interest information

    Thelwall, Mike; Stuart, Emma (Elsevier, 2018-12-31)
    Information about gender differences in interests is necessary to disentangle the effects of discrimination and choice when gender inequalities occur, such as in employment. This article assesses gender differences in interests within the popular social news and entertainment site Reddit. A method to detect terms that are statistically significantly used more by males or females in 181 million comments in 100 subreddits shows that gender affects both the selection of subreddits and activities within most of them. The method avoids the hidden gender biases of topic modelling for this task. Although the method reveals statistically significant gender differences in interests for topics that are extensively discussed on Reddit, it cannot give definitive causes, and imitation and sharing within the site mean that additional checking is needed to verify the results. Nevertheless, with care, Reddit can serve as a useful source of insights into gender differences in interests.
  • Tweeting links to academic articles

    Thelwall, M.; Tsou, A.; Weingart, S.; Haustein, S. (2013-01-01)
    Academic articles are now frequently tweeted and so Twitter seems to be a useful tool for scholars to use to help keep up with publications and discussions in their fields. Perhaps as a result of this, tweet counts are increasingly used by digital libraries and journal websites as indicators of an article's interest or impact. Nevertheless, it is not known whether tweets are typically positive, neutral or critical, or how articles are normally tweeted. These are problems for those wishing to tweet articles effectively and for those wishing to know whether tweet counts in digital libraries should be taken seriously. In response, a pilot study content analysis was conducted of 270 tweets linking to articles in four journals, four digital libraries and two DOI URLs, collected over a period of eight months in 2012. The vast majority of the tweets echoed an article title (42%) or a brief summary (41%). One reason for summarising an article seemed to be to translate it for a general audience. Few tweets explicitly praised an article and none were critical. Most tweets did not directly refer to the article author, but some did and others were clearly self-citations. In summary, tweets containing links to scholarly articles generally provide little more than publicity, and so whilst tweet counts may provide evidence of the popularity of an article, the contents of the tweets themselves are unlikely to give deep insights into scientists' reactions to publications, except perhaps in special cases.
  • Do gendered citation advantages influence field participation? Four unusual fields in the USA 1996-2017

    Thelwall, Mike (Springer, 2018-09-29)
    Gender inequalities in science are an ongoing concern, but their current causes are not well understood. This article investigates four fields with unusual proportions of female researchers in the USA for their subject matter, according to some current theories. It assesses how their gender composition and gender differences in citation rates have changed over time. All fields increased their share of female first-authored research, but at varying rates. The results give no evidence of the importance of citations, despite their unusual gender characteristics. For example, the field with the highest share of female-authored research and the most rapid increase had the largest male citation advantage. Differing micro-specialisms seems more likely than bias to be a cause of gender differences in citation rates, when present.

View more