Recent Submissions

  • RGCL at GermEval 2019: offensive language detection with deep learning

    Plum, A; Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, R (German Society for Computational Linguistics & Language Technology, 2019-10-08)
    This paper describes the system submitted by the RGCL team to GermEval 2019 Shared Task 2: Identification of Offensive Language. We experimented with five different neural network architectures in order to classify Tweets in terms of offensive language. By means of comparative evaluation, we select the best performing for each of the three subtasks. Overall, we demonstrate that using only minimal preprocessing we are able to obtain competitive results.
  • RGCL at IDAT: deep learning models for irony detection in Arabic language

    Ranasinghe, Tharindu; Saadany, Hadeel; Plum, Alistair; Mandhari, Salim; Mohamed, Emad; Orasan, Constantin; Mitkov, Ruslan (IDAT, 2019-12-12)
    This article describes the system submitted by the RGCL team to the IDAT 2019 Shared Task: Irony Detection in Arabic Tweets. The system detects irony in Arabic tweets using deep learning. The paper evaluates the performance of several deep learning models, as well as how text cleaning and text pre-processing influence the accuracy of the system. Several runs were submitted. The highest F1 score achieved for one of the submissions was 0.818 making the team RGCL rank 4th out of 10 teams in final results. Overall, we present a system that uses minimal pre-processing but capable of achieving competitive results.
  • Large-scale data harvesting for biographical data

    Plum, Alistair; Zampieri, Marcos; Orasan, Constantin; Wandl-Vogt, Eveline; Mitkov, R (CEUR, 2019-09-05)
    This paper explores automatic methods to identify relevant biography candidates in large databases, and extract biographical information from encyclopedia entries and databases. In this work, relevant candidates are defined as people who have made an impact in a certain country or region within a pre-defined time frame. We investigate the case of people who had an impact in the Republic of Austria and died between 1951 and 2019. We use Wikipedia and Wikidata as data sources and compare the performance of our information extraction methods on these two databases. We demonstrate the usefulness of a natural language processing pipeline to identify suitable biography candidates and, in a second stage, extract relevant information about them. Even though they are considered by many as an identical resource, our results show that the data from Wikipedia and Wikidata differs in some cases and they can be used in a complementary way providing more data for the compilation of biographies.
  • Automatic question answering for medical MCQs: Can it go further than information retrieval?

    Ha, Le An; Yaneva, Viktoriya (RANLP, 2019-09-04)
    We present a novel approach to automatic question answering that does not depend on the performance of an information retrieval (IR) system and does not require training data. We evaluate the system performance on a challenging set of university-level medical science multiple-choice questions. Best performance is achieved when combining a neural approach with an IR approach, both of which work independently. Unlike previous approaches, the system achieves statistically significant improvement over the random guess baseline even for questions that are labeled as challenging based on the performance of baseline solvers.
  • Automatic summarisation: 25 years On

    Orăsan, Constantin (Cambridge University Press (CUP), 2019-09-19)
    Automatic text summarisation is a topic that has been receiving attention from the research community from the early days of computational linguistics, but it really took off around 25 years ago. This article presents the main developments from the last 25 years. It starts by defining what a summary is and how its definition changed over time as a result of the interest in processing new types of documents. The article continues with a brief history of the field and highlights the main challenges posed by the evaluation of summaries. The article finishes with some thoughts about the future of the field.
  • A survey of the perceived text adaptation needs of adults with autism

    Yaneva, Viktoriya; Orasan, Constantin; Ha, L; Ponomareva, Natalia (RANLP, 2019-09-02)
    NLP approaches to automatic text adaptation often rely on user-need guidelines which are generic and do not account for the differences between various types of target groups. One such group are adults with high-functioning autism, who are usually able to read long sentences and comprehend difficult words but whose comprehension may be impeded by other linguistic constructions. This is especially challenging for real-world usergenerated texts such as product reviews, which cannot be controlled editorially and are thus in a stronger need of automatic adaptation. To address this problem, we present a mixedmethods survey conducted with 24 adult webusers diagnosed with autism and an agematched control group of 33 neurotypical participants. The aim of the survey is to identify whether the group with autism experiences any barriers when reading online reviews, what these potential barriers are, and what NLP methods would be best suited to improve the accessibility of online reviews for people with autism. The group with autism consistently reported significantly greater difficulties with understanding online product reviews compared to the control group and identified issues related to text length, poor topic organisation, identifying the intention of the author, trustworthiness, and the use of irony, sarcasm and exaggeration.
  • Semantic textual similarity with siamese neural networks

    Orasan, Constantin; Mitkov, Ruslan; Ranasinghe, Tharindu (RANLP, 2019-09-02)
    Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods
  • Toponym detection in the bio-medical domain: A hybrid approach with deep learning

    Plum, Alistair; Ranasinghe, Tharindu; Orăsan, Constantin (RANLP, 2019-09-02)
    This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.
  • Sentence simplification for semantic role labelling and information extraction

    Evans, Richard; Orasan, Constantin (RANLP, 2019-12-31)
    In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.
  • Enhancing unsupervised sentence similarity methods with deep contextualised word representations

    Ranashinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (RANLP, 2019-09-02)
    Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains.
  • Do online resources give satisfactory answers to questions about meaning and phraseology?

    Hanks, Patrick; Franklin, Emma (Springer, 2019-09-18)
    In this paper we explore some aspects of the differences between printed paper dictionaries and online dictionaries in the ways in which they explain meaning and phraseology. After noting the importance of the lexicon as an inventory of linguistic items and the neglect in both linguistics and lexicography of phraseological aspects of that inventory, we investigate the treatment in online resources of phraseology – in particular, the phrasal verbs wipe out and put down – and we go on to investigate a word, dope, that has undergone some dramatic meaning changes during the 20th century. In the course of discussion, we mention the new availability of corpus evidence and the technique of Corpus Pattern Analysis, which is important for linking phraseology and meaning and distinguishing normal phraseology from rare and unusual phraseology. The online resources that we discuss include Google, the Urban Dictionary (UD), and Wiktionary.
  • Predicting the difficulty of multiple choice questions in a high-stakes medical exam

    Ha, Le; Yaneva, Victoria; Balwin, Peter; Mee, Janet (Association for Computational Linguistics, 2019-08-02)
    Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs.
  • Profiling idioms: a sociolexical approach to the study of phraseological patterns

    Moze, Sara; Mohamed, Emad (Springer, 2019-12-31)
    This paper introduces a novel approach to the study of lexical and pragmatic meaning called ‘sociolexical profiling’, which aims at correlating the use of lexical items with author-attributed demographic features, such as gender, age, profession, and education. The approach was applied to a case study of a set of English idioms derived from the Pattern Dictionary of English Verbs (PDEV), a corpus-driven lexical resource which defines verb senses in terms of the phraseological patterns in which a verb typically occurs. For each selected idiom, a gender profile was generated based on data extracted from the Blog Authorship Corpus (BAC) in order to establish whether any statistically significant differences can be detected in the way men and women use idioms in every-day communication. A quantitative and qualitative analysis of the gender profiles was subsequently performed, enabling us to test the validity of the proposed approach. If performed on a large scale, we believe that sociolexical profiling will have important implications for several areas of research, including corpus lexicography, translation, creative writing, forensic linguistics, and natural language processing.
  • Análisis de necesidades documentales y terminológicas de médicos y traductores médicos como base para el diseño de un diccionario multilingüe de nueva generación

    Corpas Pastor, Gloria; Roldán Juárez, Marina (Universitat Jaume I, 2014)
    En el presente trabajo se plantea el diseño de un recurso lexicográfico multilingüe orientado a médicos y traductores médicos. En la actualidad, no existe ningún recurso que satisfaga a ambos colectivos por igual, debido a que estos poseen necesidades muy diferentes. Sin embargo, partimos de la premisa de que se podría crear una herramienta única, modular, adaptable y flexible, que responda a sus diversas expectativas, necesidades y preferencias. Se parte para ello de un análisis de necesidades siguiendo el método empírico de recogida de datos en línea mediante una encuesta trilingüe.
  • Recursos documentales para la traducción de seguros turísticos en el par de lenguas inglés-español

    Corpas Pastor, Gloria; Seghiri Domínguez, Miriam; Postigo Pinazo, Encarnación (Universidad de Málaga, 2007-04-05)
    Las páginas que siguen a continuación resumen parte de la investigación realizada en el marco de un proyecto de I+D interdisciplinar e interuniversitario sobre Tecnologías de la Traducción, denominado TURICOR (BFF2003-04616, MCYT), cuyos objetivos principales son la compilación virtual de un corpus multilingüe de contratación turística a partir de recursos electrónicos y el desarrollo de un sistema de generación de lenguaje natural (GLN), también multilingü. El corpus Turicor alberga, pues, diversos tipos de documentos relativos a la contratación turística en las cuatro lenguas implicadas (español, inglés, alemán e italiano). En concreto, la tipologíatextual que ha vertebrado la selección de los documentos que integran los distintossubcorpus de los que consta Turicor abarca lo siguiente: legislación turística (internacional, comunitaria y nacional de los respectivos países incluidos); condiciones generales, formularios y contratos turísticos.
  • Mutual terminology extraction using a statistical framework

    Ha, Le An; Mitkov, Ruslan; Pastor, Gloria Corpas (Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), 2008-06-16)
    In this paper, we explore a statistical framework for mutual bilingual terminology extraction. We propose three probabilistic models to assess the proposition that automatic alignment can play an active role in bilingual terminology extraction and translate it into mutual bilingual terminology extraction. The results indicate that such models are valid and can show that mutual bilingual terminology extraction is indeed a viable approach.
  • Size Matters: A Quantitative Approach to Corpus Representativeness

    Corpas Pastor, Gloria; Seghiri Domínguez, Míriam; Rabadán, Rosa (Publicaciones Universidad de León, 2010-06-01)
    We should always bear in mind that the assumption of representativeness ‘must be regarded largely as an act of faith’ (Leech 1991: 2), as at present we have no means of ensuring it, or even evaluating it objectively. (Tognini-Bonelli 2001: 57) Corpus Linguistics (CL) has not yet come of age. It does not make any difference whether we consider it a full-fledged linguistic discipline (Tognini-Bonelli 2000: 1) or, else, a set of analytical techniques that can be applied to any discipline (McEnery et al. 2006: 7). The truth is that CL is still striving to solve thorny, central issues such as optimum size, balance and representativeness of corpora (of the language as a whole or of some subset of the language). Corpus-driven/based studies rely on the quality and representativeness of each corpus as their true foundation for producing valid results. This entails deciding on valid external and internal criteria for corpus design and compilation. A basic tenet is that corpus representativeness determines the kinds of research questions that can be addressed and the generalizability of the results obtained (cf. Biber et al. 1988: 246). Unfortunately, faith and beliefs do not seem to ensure quality. In this paper we will attempt to deal with these key questions. Firstly, we will give a brief description of the R&D projects which originally have served as the main framework for this research. Secondly, we will focus on the complex notion of corpus representativeness and ideal size, from both a theoretical and an applied perspective. Finally, we will describe a computer application which has been developed as part of the research. This software will be used to verify whether a sample bilingual comparable corpus could be deemed representative.
  • All that Glitters is not Gold when Translating Phraseological Units

    Corpas Pastor, Gloria; Monti, Johanna; Mitkov, Ruslan; Corpas Pastor, Gloria; Seretan, Violeta (European Association for Machine Translation (EAMT), 2013-09-02)
    Phraseological unit is an umbrella term which covers a wide range of multi-word units (collocations, idioms, proverbs, routine formulae, etc.). Phraseological units (PUs) are pervasive in all languages and exhibit a peculiar combinatorial nature. PUs are usually frequent, cognitively salient, syntactically frozen and/or semantically opaque. Besides, their creative manipulations in discourse can be anything but predictable, straightforward or easy to process. And when it comes to translating, problems multiply exponentially. It goes without saying that cultural differences and linguistic anisomorphisms go hand in hand with issues arising from varying degrees of equivalence at the levels of system and text. No wonder PUs have been considered a pain in the neck within the NLP community. This presentation will focus on contrastive and translational features of phraseological units. It will consist of three parts. As a convenient background, the first part will contrast two similar concepts: multi-word unit (the preferred term within the NLP community) versus phraseological unit (the preferred term in phraseology). The second part will deal with phraseological systems in general, their structure and functioning. Finally, the third part will adopt a contrastive approach, with especial reference to translators’ strategies, procedures and choices. For good or for bad, when it comes to rendering phraseological units, human translation and computer-assisted translation appear to share the same garden path.
  • Register-Specific Collocational Constructions in English and Spanish: A Usage-Based Approach

    Pastor, Gloria Corpas (Science Publications, 2015-03-01)
    Constructions are usage-based, conventionalised pairings of form and function within a cline of complexity and schematisation. Most research within Construction Grammar has focused on the monolingual description of schematic constructions: Mainly in English, but to a lesser extent in other languages as well. By contrast, very little constructional analyses have been carried out across languages. In this study we will focus on a type of partially substantive construction from the point of view of contrastive analysis and translation which, to the best of our knowledge, is one of the first studies of this kind. The first half of the article lays down the theoretical foundations of the study and introduces Construction Grammar as well as other formalisms used in literature in order to provide a construal account of collocations, a pervasive phenomenon in language. The experimental part describes the case study of V NP collocations with disease/enfermedad in comparable corpora in English and Spanish, both in the general domain and in the specialised medical domain. It is provided a comparative analysis of these constructions across domains and languages in terms of token-type ratio (constructional restriction-rate), lexical function, type of determiner, frequency ranking of the verbal collocate and domain specificity of collocates, among others. New measures to assess construal bondness will be put forward (lexical filledness rate and individual productivity rate) and special attention will be paid to register-dependent equivalent semantic-functional counterparts in English and Spanish and mismatches.
  • Identification of translationese: a machine learning approach

    Ilisei, Iustina; Inkpen, Diana; Corpas Pastor, Gloria; Mitkov, Ruslan; Gelbukh, A (Springer, 2010)
    This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.

View more