• Toponym detection in the bio-medical domain: A hybrid approach with deep learning

      Plum, Alistair; Ranasinghe, Tharindu; Orăsan, Constantin (RANLP, 2019-09-02)
      This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.
    • Cross-lingual transfer learning and multitask learning for capturing multiword expressions

      Taslimipoor, Shiva; Rohanian, Omid; Ha, Le An (Association for Computational Linguistics, 2019-08-31)
      Recent developments in deep learning have prompted a surge of interest in the application of multitask and transfer learning to NLP problems. In this study, we explore for the first time, the application of transfer learning (TRL) and multitask learning (MTL) to the identification of Multiword Expressions (MWEs). For MTL, we exploit the shared syntactic information between MWE and dependency parsing models to jointly train a single model on both tasks. We specifically predict two types of labels: MWE and dependency parse. Our neural MTL architecture utilises the supervision of dependency parsing in lower layers and predicts MWE tags in upper layers. In the TRL scenario, we overcome the scarcity of data by learning a model on a larger MWE dataset and transferring the knowledge to a resource-poor setting in another language. In both scenarios, the resulting models achieved higher performance compared to standard neural approaches.
    • Predicting the difficulty of multiple choice questions in a high-stakes medical exam

      Ha, Le; Yaneva, Victoria; Balwin, Peter; Mee, Janet (Association for Computational Linguistics, 2019-08-02)
      Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs.
    • Computing Happiness from Textual Data

      Mohamed, Emad; Mostafa, Safa (MDPI, 2019-07-03)
      In this paper, we use a corpus of about 100,000 happy moments written by people of different genders, marital statuses, parenthood statuses, and ages to explore the following questions: Are there differences between men and women, married and unmarried individuals, parents and non-parents, and people of different age groups in terms of their causes of happiness and how they express happiness? Can gender, marital status, parenthood status and/or age be predicted from textual data expressing happiness? The first question is tackled in two steps: first, we transform the happy moments into a set of topics, lemmas, part of speech sequences, and dependency relations; then, we use each set as predictors in multi-variable binary and multinomial logistic regressions to rank these predictors in terms of their influence on each outcome variable (gender, marital status, parenthood status and age). For the prediction task, we use character, lexical, grammatical, semantic, and syntactic features in a machine learning document classification approach. The classification algorithms used include logistic regression, gradient boosting, and fastText. Our results show that textual data expressing moments of happiness can be quite beneficial in understanding the “causes of happiness” for different social groups, and that social characteristics like gender, marital status, parenthood status, and, to some extent age, can be successfully predicted form such textual data. This research aims to bring together elements from philosophy and psychology to be examined by computational corpus linguistics methods in a way that promotes the use of Natural Language Processing for the Humanities.
    • Exploiting Data-Driven Hybrid Approaches to Translation in the EXPERT Project

      Orăsan, Constantin; Escartín, Carla Parra; Torres, Lianet Sepúlveda; Barbu, Eduard; Ji, Meng; Oakes, Michael (Cambridge University Press, 2019-06-13)
      Technologies have transformed the way we work, and this is also applicable to the translation industry. In the past thirty to thirty-five years, professional translators have experienced an increased technification of their work. Barely thirty years ago, a professional translator would not have received a translation assignment attached to an e-mail or via an FTP and yet, for the younger generation of professional translators, receiving an assignment by electronic means is the only reality they know. In addition, as pointed out in several works such as Folaron (2010) and Kenny (2011), professional translators now have a myriad of tools available to use in the translation process.
    • RGCL-WLV at SemEval-2019 Task 12: Toponym Detection

      Plum, Alistair; Ranasinghe, Tharindu; Calleja, Pablo; Orasan, Constantin; Mitkov, Ruslan (ACL, 2019-06-07)
      This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%.
    • Are classic references cited first? An analysis of citation order within article sections

      Thelwall, Mike (Springer, 2019-06-07)
      Early citations within an article section may have an agenda-setting role but contribute little to the new research. To investigate whether this practice may be common, this article assesses whether the average impact of cited references is influenced by the order in which they are cited within article sections. This is tested on 1,683,299,868 citations to 41,068,375 unique journal articles from 1,470,209 research articles in the PubMed Open Access collection, split into 22 fields. The results show that the first cited article in the Introduction and Background have much higher average citation impacts than later articles, and the same is true to a lesser extent for the Discussion and Conclusion in most fields, but not the Methods and Results. The findings do not prove that early citations are less central to the citing article but nevertheless add to previous evidence suggesting that this practice may be widespread. It may therefore be useful to distinguish between initial introductory citations when evaluating citation impact, or to use impact indicators that implicitly or explicitly give less weight to the citation counts of highly cited articles.
    • GCN-Sem at SemEval-2019 Task 1: Semantic Parsing using Graph Convolutional and Recurrent Neural Networks

      Taslimipoor, Shiva; Rohanian, Omid; Može, Sara (Association for Computational Linguistics, 2019-06-06)
      This paper describes the system submitted to the SemEval 2019 shared task 1 ‘Cross-lingual Semantic Parsing with UCCA’. We rely on the semantic dependency parse trees provided in the shared task which are converted from the original UCCA files and model the task as tagging. The aim is to predict the graph structure of the output along with the types of relations among the nodes. Our proposed neural architecture is composed of Graph Convolution and BiLSTM components. The layers of the system share their weights while predicting dependency links and semantic labels. The system is applied to the CONLLU format of the input data and is best suited for semantic dependency parsing.
    • Bridging the gap: attending to discontinuity in identification of multiword expressions

      Rohanian, Omid; Taslimipoor, Shiva; Kouchaki, Samaneh; Ha, Le An; Mitkov, Ruslan (Association for Computational Linguistics, 2019-06-05)
      We introduce a new method to tag Multiword Expressions (MWEs) using a linguistically interpretable language-independent deep learning architecture. We specifically target discontinuity, an under-explored aspect that poses a significant challenge to computational treatment of MWEs. Two neural architectures are explored: Graph Convolutional Network (GCN) and multi-head self-attention. GCN leverages dependency parse information, and self-attention attends to long-range relations. We finally propose a combined model that integrates complementary information from both, through a gating mechanism. The experiments on a standard multilingual dataset for verbal MWEs show that our model outperforms the baselines not only in the case of discontinuous MWEs but also in overall F-score.
    • Predicting the Type and Target of Offensive Posts in Social Media

      Zampieri, Marcos; Malmasi, Shervin; Nakov, Preslav; Rosenthal, Sara; Farra, Noura; Kumar, Ritesh (Association for Computational Linguistics, 2019-06-01)
      As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.
    • Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic

      Mohamed, Emad; Sayed, Zeeshan (ACM, 2019-05-31)
      While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.
    • Adults with High-functioning Autism Process Web Pages With Similar Accuracy but Higher Cognitive Effort Compared to Controls

      Yaneva, Victoria; Ha, Le; Eraslan, Sukru; Yesilada, Yeliz (ACM, 2019-05-31)
      To accommodate the needs of web users with high-functioning autism, a designer's only option at present is to rely on guidelines that: i) have not been empirically evaluated and ii) do not account for the di erent levels of autism severity. Before designing effective interventions, we need to obtain an empirical understanding of the aspects that speci c user groups need support with. This has not yet been done for web users at the high ends of the autism spectrum, as often they appear to execute tasks effortlessly, without facing barriers related to their neurodiverse processing style. This paper investigates the accuracy and efficiency with which high-functioning web users with autism and a control group of neurotypical participants obtain information from web pages. Measures include answer correctness and a number of eye-tracking features. The results indicate similar levels of accuracy for the two groups at the expense of efficiency for the autism group, showing that the autism group invests more cognitive effort in order to achieve the same results as their neurotypical counterparts.
    • Should citations be counted separately from each originating section?

      Thelwall, Mike (Elsevier, 2019-04-03)
      Articles are cited for different purposes and differentiating between reasons when counting citations may therefore give finer-grained citation count information. Although identifying and aggregating the individual reasons for each citation may be impractical, recording the number of citations that originate from different article sections might illuminate the general reasons behind a citation count (e.g., 110 citations = 10 Introduction citations + 100 Methods citations). To help investigate whether this could be a practical and universal solution, this article compares 19 million citations with DOIs from six different standard sections in 799,055 PubMed Central open access articles across 21 out of 22 fields. There are apparently non-systematic differences between fields in the most citing sections and the extent to which citations from one section overlap with citations from another, with some degree of overlap in most cases. Thus, at a science-wide level, section headings are partly unreliable indicators of citation context, even if they are more standard within individual fields. They may still be used within fields to help identify individual highly cited articles that have had one type of impact, especially methodological (Methods) or context setting (Introduction), but expert judgement is needed to validate the results.
    • Crossing the border between postcolonial reality and the ‘outer world’: Translation and representation of the third space into a fourth space

      Fernández Ruiz, María Remedios; Corpas Pastor, Gloria; Seghiri, Míriam (Universitat Jaume I, 2019-03-31)
      Stemming from poststructuralist interpretations of space and following Bhabha’s third space enunciation, in this paper we have coined the term fourth space and used this concept as a heuristic tool to address the need to establish a coherent standpoint for the analysis of postcolonial literature reception within a society with no immediate relation to the specific decolonisation process of the author’s country. We explore this concept through the case of the Spanish reception of African postcolonial literature. In Spain, this perspective has remained under-theorised in an era when representation of hybridity is at a vital point, since such representation will provide the social scaffolding for each person’s identity construction. Under these circumstances, literature can be transformative and the role of translation as a decolonising tool can help to create unbiased knowledge through an ethical interpretation of the original texts. We will analyse how those differentiating elements affect the translational process.
    • The rhetorical structure of science? A multidisciplinary analysis of article headings

      Thelwall, Mike (Elsevier, 2019-03-19)
      An effective structure helps an article to convey its core message. The optimal structure depends on the information to be conveyed and the expectations of the audience. In the current increasingly interdisciplinary era, structural norms can be confusing to the authors, reviewers and audiences of scientific articles. Despite this, no prior study has attempted to assess variations in the structure of academic papers across all disciplines. This article reports on the headings commonly used by over 1 million research articles from the PubMed Central Open Access collection, spanning 22 broad categories covering all academia and 172 out of 176 narrow categories. The results suggest that no headings are close to ubiquitous in any broad field and that there are substantial differences in the extent to which most headings are used. In the humanities, headings may be avoided altogether. Researchers should therefore be aware of unfamiliar structures that are nevertheless legitimate when reading, writing and reviewing articles.
    • FGFR1 expression and role in migration in low and high grade pediatric gliomas

      Egbivwie, Naomi; Cockle, Julia V.; Humphries, Matthew; Ismail, Azzam; Esteves, Filomena; Taylor, Claire; Karakoula, Katherine; Morton, Ruth; Warr, Tracy; Short, Susan C.; et al. (Frontiers Media, 2019-03-13)
      The heterogeneous and invasive nature of pediatric gliomas poses significant treatment challenges, highlighting the importance of identifying novel chemotherapeutic targets. Recently, recurrent Fibroblast growth factor receptor 1 (FGFR1) mutations in pediatric gliomas have been reported. Here, we explored the clinical relevance of FGFR1 expression, cell migration in low and high grade pediatric gliomas and the role of FGFR1 in cell migration/invasion as a potential chemotherapeutic target. A high density tissue microarray (TMA) was used to investigate associations between FGFR1 and activated phosphorylated FGFR1 (pFGFR1) expression and various clinicopathologic parameters. Expression of FGFR1 and pFGFR1 were measured by immunofluorescence and by immunohistochemistry (IHC) in 3D spheroids in five rare patient-derived pediatric low-grade glioma (pLGG) and two established high-grade glioma (pHGG) cell lines. Two-dimensional (2D) and three-dimensional (3D) migration assays were performed for migration and inhibitor studies with three FGFR1 inhibitors. High FGFR1 expression was associated with age, malignancy, tumor location and tumor grade among astrocytomas. Membranous pFGFR1 was associated with malignancy and tumor grade. All glioma cell lines exhibited varying levels of FGFR1 and pFGFR1 expression and migratory phenotypes. There were significant anti-migratory effects on the pHGG cell lines with inhibitor treatment and anti-migratory or pro-migratory responses to FGFR1 inhibition in the pLGGs. Our findings support further research to target FGFR1 signaling in pediatric gliomas.
    • Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

      Kousha, Kayvan; Thelwall, Mike (Elsevier, 2019-03-11)
      Dissertations can be the single most important scholarly outputs of junior researchers. Whilst sets of journal articles are often evaluated with the help of citation counts from the Web of Science or Scopus, these do not index dissertations and so their impact is hard to assess. In response, this article introduces a new multistage method to extract Google Scholar citation counts for large collections of dissertations from repositories indexed by Google. The method was used to extract Google Scholar citation counts for 77,884 American doctoral dissertations from 2013 to 2017 via ProQuest, with a precision of over 95%. Some ProQuest dissertations that were dual indexed with other repositories could not be retrieved with ProQuest-specific searches but could be found with Google Scholar searches of the other repositories. The Google Scholar citation counts were then compared with Mendeley reader counts, a known source of scholarly-like impact data. A fifth of the dissertations had at least one citation recorded in Google Scholar and slightly fewer had at least one Mendeley reader. Based on numerical comparisons, the Mendeley reader counts seem to be more useful for impact assessment purposes for dissertations that are less than two years old, whilst Google Scholar citations are more useful for older dissertations, especially in social sciences, arts and humanities. Google Scholar citation counts may reflect a more scholarly type of impact than that of Mendeley reader counts because dissertations attract a substantial minority of their citations from other dissertations. In summary, the new method now makes it possible for research funders, institutions and others to systematically evaluate the impact of dissertations, although additional Google Scholar queries for other online repositories are needed to ensure comprehensive coverage.
    • The way to analyse ‘way’: A case study in word-specific local grammar

      Hanks, Patrick; Može, Sara (Oxford Academic, 2019-02-11)
      Traditionally, dictionaries are meaning-driven—that is, they list different senses (or supposed senses) of each word, but do not say much about the phraseology that distinguishes one sense from another. Grammars, on the other hand, are structure-driven: they attempt to describe all possible structures of a language, but say little about meaning, phraseology, or collocation. In both disciplines during the 20th century, the practice of inventing evidence rather than discovering it led to intermittent and unpredictable distortions of fact. Since 1987, attempts have been made in both lexicography (Cobuild) and syntactic theory (pattern grammar, construction grammar) to integrate meaning and phraseology. Corpora now provide empirical evidence on a large scale for lexicosyntactic description, but there is still a long way to go. Many cherished beliefs must be abandoned before a synthesis between empirical lexical analysis and grammatical theory can be achieved. In this paper, by empirical analysis of just one word (the noun way), we show how corpus evidence can be used to tackle the complexities of lexical and constructional meaning, providing new insights into the lexis-grammar interface.
    • The influence of highly cited papers on field normalised indicators

      Thelwall, Mike (Springer, 2019-01-05)
      Field normalised average citation indicators are widely used to compare countries, universities and research groups. The most common variant, the Mean Normalised Citation Score (MNCS), is known to be sensitive to individual highly cited articles but the extent to which this is true for a log-based alternative, the Mean Normalised Log Citation Score (MNLCS), is unknown. This article investigates country-level highly cited outliers for MNLCS and MNCS for all Scopus articles from 2013 and 2012. The results show that MNLCS is influenced by outliers, as measured by kurtosis, but at a much lower level than MNCS. The largest outliers were affected by the journal classifications, with the Science-Metrix scheme producing much weaker outliers than the internal Scopus scheme. The high Scopus outliers were mainly due to uncitable articles reducing the average in some humanities categories. Although outliers have a numerically small influence on the outcome for individual countries, changing indicator or classification scheme influences the results enough to affect policy conclusions drawn from them. Future field normalised calculations should therefore explicitly address the influence of outliers in their methods and reporting.
    • New directions in the study of family names

      Hanks, Patrick; Boullón Agrelo, Ana Isabel (Consello da Cultura Galega, 2018-12-28)
      This paper explores and explains recent radical developments in resources and methodology for studying the origins, cultural associations, and histories of family names (also called ‘surnames’). It summarizes the current state of the art and outlines new resources and procedures that are now becoming available. It shows how such innovations can enable the correction of errors in previous work and improve the accuracy of dictionaries of family names, with a focus on the English-speaking world. Developments such as the digitization of archives are having a profound effect, not only on the interpretation and understanding of traditional, ‘established’ family names and their histories, but also of names in other languages and other cultures. There are literally millions of different family names in the world today, many of which have never been studied at all. What are good criteria for selection of entries in a dictionary of family names, and what can be said about them? What is the nature of the evidence? How stable (or how variable) are family names over time? What are the effects of factors such as migration? What is the relationship between family names and geographical locations, given that people can and do move around? What is the relationship between traditional philological and historical approaches to the subject and statistical analysis of newly available digitized data? The paper aims to contribute to productive discussion of such questions.