• Toponym detection in the bio-medical domain: A hybrid approach with deep learning

      Plum, Alistair; Ranasinghe, Tharindu; Orăsan, Constantin (RANLP, 2019-09-02)
      This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.
    • Sentence simplification for semantic role labelling and information extraction

      Evans, Richard; Orasan, Constantin (RANLP, 2019-09-02)
      In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.
    • Enhancing unsupervised sentence similarity methods with deep contextualised word representations

      Ranashinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (RANLP, 2019-09-02)
      Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains.
    • Do online resources give satisfactory answers to questions about meaning and phraseology?

      Hanks, Patrick; Franklin, Emma (Springer, 2019-09-18)
      In this paper we explore some aspects of the differences between printed paper dictionaries and online dictionaries in the ways in which they explain meaning and phraseology. After noting the importance of the lexicon as an inventory of linguistic items and the neglect in both linguistics and lexicography of phraseological aspects of that inventory, we investigate the treatment in online resources of phraseology – in particular, the phrasal verbs wipe out and put down – and we go on to investigate a word, dope, that has undergone some dramatic meaning changes during the 20th century. In the course of discussion, we mention the new availability of corpus evidence and the technique of Corpus Pattern Analysis, which is important for linking phraseology and meaning and distinguishing normal phraseology from rare and unusual phraseology. The online resources that we discuss include Google, the Urban Dictionary (UD), and Wiktionary.
    • Predicting the difficulty of multiple choice questions in a high-stakes medical exam

      Ha, Le; Yaneva, Victoria; Balwin, Peter; Mee, Janet (Association for Computational Linguistics, 2019-08-02)
      Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs.
    • Profiling idioms: a sociolexical approach to the study of phraseological patterns

      Moze, Sara; Mohamed, Emad (Springer, 2019-09-18)
      This paper introduces a novel approach to the study of lexical and pragmatic meaning called ‘sociolexical profiling’, which aims at correlating the use of lexical items with author-attributed demographic features, such as gender, age, profession, and education. The approach was applied to a case study of a set of English idioms derived from the Pattern Dictionary of English Verbs (PDEV), a corpus-driven lexical resource which defines verb senses in terms of the phraseological patterns in which a verb typically occurs. For each selected idiom, a gender profile was generated based on data extracted from the Blog Authorship Corpus (BAC) in order to establish whether any statistically significant differences can be detected in the way men and women use idioms in every-day communication. A quantitative and qualitative analysis of the gender profiles was subsequently performed, enabling us to test the validity of the proposed approach. If performed on a large scale, we believe that sociolexical profiling will have important implications for several areas of research, including corpus lexicography, translation, creative writing, forensic linguistics, and natural language processing.
    • Intelligent text processing to help readers with autism

      Orăsan, C; Evans, R; Mitkov, R (Springer International Publishing, 2017-11-18)
      © 2018, Springer International Publishing AG. Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.
    • Análisis de necesidades documentales y terminológicas de médicos y traductores médicos como base para el diseño de un diccionario multilingüe de nueva generación

      Corpas Pastor, Gloria; Roldán Juárez, Marina (Universitat Jaume I, 2014)
      En el presente trabajo se plantea el diseño de un recurso lexicográfico multilingüe orientado a médicos y traductores médicos. En la actualidad, no existe ningún recurso que satisfaga a ambos colectivos por igual, debido a que estos poseen necesidades muy diferentes. Sin embargo, partimos de la premisa de que se podría crear una herramienta única, modular, adaptable y flexible, que responda a sus diversas expectativas, necesidades y preferencias. Se parte para ello de un análisis de necesidades siguiendo el método empírico de recogida de datos en línea mediante una encuesta trilingüe.
    • Recursos documentales para la traducción de seguros turísticos en el par de lenguas inglés-español

      Corpas Pastor, Gloria; Seghiri Domínguez, Miriam; Postigo Pinazo, Encarnación (Universidad de Málaga, 2007-04-05)
      Las páginas que siguen a continuación resumen parte de la investigación realizada en el marco de un proyecto de I+D interdisciplinar e interuniversitario sobre Tecnologías de la Traducción, denominado TURICOR (BFF2003-04616, MCYT), cuyos objetivos principales son la compilación virtual de un corpus multilingüe de contratación turística a partir de recursos electrónicos y el desarrollo de un sistema de generación de lenguaje natural (GLN), también multilingü. El corpus Turicor alberga, pues, diversos tipos de documentos relativos a la contratación turística en las cuatro lenguas implicadas (español, inglés, alemán e italiano). En concreto, la tipologíatextual que ha vertebrado la selección de los documentos que integran los distintossubcorpus de los que consta Turicor abarca lo siguiente: legislación turística (internacional, comunitaria y nacional de los respectivos países incluidos); condiciones generales, formularios y contratos turísticos.
    • Mutual terminology extraction using a statistical framework

      Ha, Le An; Mitkov, Ruslan; Pastor, Gloria Corpas (Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), 2008-06-16)
      In this paper, we explore a statistical framework for mutual bilingual terminology extraction. We propose three probabilistic models to assess the proposition that automatic alignment can play an active role in bilingual terminology extraction and translate it into mutual bilingual terminology extraction. The results indicate that such models are valid and can show that mutual bilingual terminology extraction is indeed a viable approach.
    • Size Matters: A Quantitative Approach to Corpus Representativeness

      Corpas Pastor, Gloria; Seghiri Domínguez, Míriam; Rabadán, Rosa (Publicaciones Universidad de León, 2010-06-01)
      We should always bear in mind that the assumption of representativeness ‘must be regarded largely as an act of faith’ (Leech 1991: 2), as at present we have no means of ensuring it, or even evaluating it objectively. (Tognini-Bonelli 2001: 57) Corpus Linguistics (CL) has not yet come of age. It does not make any difference whether we consider it a full-fledged linguistic discipline (Tognini-Bonelli 2000: 1) or, else, a set of analytical techniques that can be applied to any discipline (McEnery et al. 2006: 7). The truth is that CL is still striving to solve thorny, central issues such as optimum size, balance and representativeness of corpora (of the language as a whole or of some subset of the language). Corpus-driven/based studies rely on the quality and representativeness of each corpus as their true foundation for producing valid results. This entails deciding on valid external and internal criteria for corpus design and compilation. A basic tenet is that corpus representativeness determines the kinds of research questions that can be addressed and the generalizability of the results obtained (cf. Biber et al. 1988: 246). Unfortunately, faith and beliefs do not seem to ensure quality. In this paper we will attempt to deal with these key questions. Firstly, we will give a brief description of the R&D projects which originally have served as the main framework for this research. Secondly, we will focus on the complex notion of corpus representativeness and ideal size, from both a theoretical and an applied perspective. Finally, we will describe a computer application which has been developed as part of the research. This software will be used to verify whether a sample bilingual comparable corpus could be deemed representative.
    • Identification of translationese: a machine learning approach

      Ilisei, Iustina; Inkpen, Diana; Corpas Pastor, Gloria; Mitkov, Ruslan; Gelbukh, A (Springer, 2010)
      This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.
    • All that Glitters is not Gold when Translating Phraseological Units

      Corpas Pastor, Gloria; Monti, Johanna; Mitkov, Ruslan; Corpas Pastor, Gloria; Seretan, Violeta (European Association for Machine Translation (EAMT), 2013-09-02)
      Phraseological unit is an umbrella term which covers a wide range of multi-word units (collocations, idioms, proverbs, routine formulae, etc.). Phraseological units (PUs) are pervasive in all languages and exhibit a peculiar combinatorial nature. PUs are usually frequent, cognitively salient, syntactically frozen and/or semantically opaque. Besides, their creative manipulations in discourse can be anything but predictable, straightforward or easy to process. And when it comes to translating, problems multiply exponentially. It goes without saying that cultural differences and linguistic anisomorphisms go hand in hand with issues arising from varying degrees of equivalence at the levels of system and text. No wonder PUs have been considered a pain in the neck within the NLP community. This presentation will focus on contrastive and translational features of phraseological units. It will consist of three parts. As a convenient background, the first part will contrast two similar concepts: multi-word unit (the preferred term within the NLP community) versus phraseological unit (the preferred term in phraseology). The second part will deal with phraseological systems in general, their structure and functioning. Finally, the third part will adopt a contrastive approach, with especial reference to translators’ strategies, procedures and choices. For good or for bad, when it comes to rendering phraseological units, human translation and computer-assisted translation appear to share the same garden path.
    • Using semi-automatic compiled corpora for medical terminology and vocabulary building in the healthcare domain

      Gutiérrez Florido, Rut; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam (Université Paris 13, 2013-10-28)
      English, Spanish and German are amongst the most spoken languages in Europe. Thus it is likely that patients from one EU member state seeking medical treatment in another will speak or understand one of these. However, there is a lack of resources to teach efficient communication between patients and medics. To combat this, the TELL-ME project will provide a fully targeted package. This includes learning materials for Medical English, Spanish and German aimed at medical staff already in the other countries or undertaking cross-border mobility. The learning process will be supported by computer-aided tools based on corpora. For this reason, in this workshop we present the semi-automatic compilation of the TELL-ME corpus, whose function is to support the e-learning platform of the TELL-ME project, together with its self-assessment exercises emphasising the importance of specialised terminology in the acquisition of communicative and language skills.
    • Register-Specific Collocational Constructions in English and Spanish: A Usage-Based Approach

      Pastor, Gloria Corpas (Science Publications, 2015-03-01)
      Constructions are usage-based, conventionalised pairings of form and function within a cline of complexity and schematisation. Most research within Construction Grammar has focused on the monolingual description of schematic constructions: Mainly in English, but to a lesser extent in other languages as well. By contrast, very little constructional analyses have been carried out across languages. In this study we will focus on a type of partially substantive construction from the point of view of contrastive analysis and translation which, to the best of our knowledge, is one of the first studies of this kind. The first half of the article lays down the theoretical foundations of the study and introduces Construction Grammar as well as other formalisms used in literature in order to provide a construal account of collocations, a pervasive phenomenon in language. The experimental part describes the case study of V NP collocations with disease/enfermedad in comparable corpora in English and Spanish, both in the general domain and in the specialised medical domain. It is provided a comparative analysis of these constructions across domains and languages in terms of token-type ratio (constructional restriction-rate), lexical function, type of determiner, frequency ranking of the verbal collocate and domain specificity of collocates, among others. New measures to assess construal bondness will be put forward (lexical filledness rate and individual productivity rate) and special attention will be paid to register-dependent equivalent semantic-functional counterparts in English and Spanish and mismatches.
    • BDAFRICA: diseño e implementación de una base de datos de la literatura poscolonial africana publicada en España

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Universidad de Valladolid, 2016-01-10)
      Este trabajo demuestra que no existe un repositorio que incluya los autores poscoloniales africanos publicados hasta el momento en España y que permita, por ende, realizar estudios cuantitativos y cualitativos del impacto de esta literatura con la precisión deseable. Esto supone una carencia tanto para investigaciones académicas como para el sector editorial a la hora de analizar tendencias de selección y recepción en el mercado. Ante esta situación, el objetivo primordial de este trabajo es diseñar e implementar una base de datos, basada en MySQL y delimitada por unos parámetros muy concretos, que recoja todas las obras de autores africanos publicadas en castellano en España entre 1972 (año en que España se unió al sistema ISBN) y 2014. Tras determinar unos criterios de diseño y unos protocolos de compilación específcos, el desarrollo metodológico se ha dividido en cuatro fases: recopilación, almacenamiento, tratamiento y difusión de los datos. Así, la base de datos BDÁFRICA consigue un doble objetivo: por un lado, proporciona a los investigadores datos fables en los que basar sus estudios y, por otro, permitiría ofrecer por primera vez datos estadísticos de la evolución de la publicación de obras de autores africanos en España en los últimos 42 años.
    • Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?

      Costa, Hernani; Muñoz, Isabel Dúran; Pastor, Gloria Corpas; Mitkov, Ruslan (Universidade de Vigo & Universidade do Minho, 2016-07-22)
      Decisões tomadas anteriormente à compilação de um corpo comparável têm um grande impacto na forma em que este será posteriormente construído e analisado. Diversas variáveis e critérios externos são normalmente seguidos na construção de um corpo, mas pouco se tem investigado sobre a sua distribuição de similaridade textual interna ou nas suas vantagens qualitativas para a investigação. Numa tentativa de preencher esta lacuna, este artigo tem como objetivo apresentar uma metodologia simples, contudo eficiente, capaz de medir o grau de similaridade interno de um corpo. Para isso, a metodologia proposta usa diversas técnicas de processamento de linguagem natural e vários métodos estatísticos, numa tentativa bem sucedida de avaliar o grau de similaridade entre documentos. Os nossos resultados demonstram que a utilização de uma lista de entidades comuns e um conjunto de medidas de similaridade distribucional são suficientes, não só para descrever e avaliar o grau de similaridade entre os documentos num corpo comparável, mas também para os classificar de acordo com seu grau de semelhança e, consequentemente, melhorar a qualidade do corpos através da eliminação de documentos irrelevantes.
    • Translating English verbal collocations into Spanish: On distribution and other relevant differences related to diatopic variation

      Corpas Pastor, Gloria (John Benjamins Publishing Company, 2015-12-21)
      Language varieties should be taken into account in order to enhance fluency and naturalness of translated texts. In this paper we will examine the collocational verbal range for prima-facie translation equivalents of words like decision and dilemma, which in both languages denote the act or process of reaching a resolution after consideration, resolving a question or deciding something. We will be mainly concerned with diatopic variation in Spanish. To this end, we set out to develop a giga-token corpus-based protocol which includes a detailed and reproducible methodology sufficient to detect collocational peculiarities of transnational languages. To our knowledge, this is one of the first observational studies of this kind. The paper is organised as follows. Section 1 introduces some basic issues about the translation of collocations against the background of languages’ anisomorphism. Section 2 provides a feature characterisation of collocations. Section 3 deals with the choice of corpora, corpus tools, nodes and patterns. Section 4 covers the automatic retrieval of the selected verb + noun (object) collocations in general Spanish and the co-existing national varieties. Special attention is paid to comparative results in terms of similarities and mismatches. Section 5 presents conclusions and outlines avenues of further research.
    • Recepción en España de la literatura africana en lengua inglesa: generación de datos estadísticos con la base de datos bibliográfica especializada BDÁFRICA

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Fundacio per la Universitat Oberta de Catalunya, 2018-11-20)
      El presente artículo examina la recepción de la literatura africana en lengua inglesa en España basándonos en BDÁFRICA, una base de datos bibliográfica que recoge obras de autores nacidos en África y publicadas en español y en España entre 1972 y 2014. Se ofrece una reflexión crítica de las dificultades para definir la literatura africana como objeto de estudio, debido a su complejidad y heterogeneidad. Se propone, además, un conciso recorrido historiográfico por la conformación del canon de dicha literatura que se ha realizado desde Occidente. Asimismo, se demuestra la falta de estudios estadísticos sobre la recepción de literatura africana en lengua inglesa en España. Respondiendo a esta necesidad, el objetivo del artículo es detallar y analizar los datos estadísticos inéditos que proporciona la base de datos, adoptando una metodología descriptiva. Los resultados de este estudio, que aporta datos cuantitativos y cualitativos fiables y novedosos, son originales en tanto en cuanto reflejan y señalan los problemas de la traducción de la literatura africana en lengua inglesa en España. BDÁFRICA, que es gratuita y está disponible en red, pretende ser un recurso y una fuente que estimule el desarrollo de la investigación en literatura poscolonial en España. Sin duda, esta base de datos bibliográfica especializada es una herramienta muy valiosa, especialmente para investigadores, traductores y editoriales interesados en literatura africana.
    • El EEES y la competencia tecnológica: los nuevos grados en Traducción

      Corpas Pastor, Gloria; Muñoz, María (Universidad de Las Palmas de Gran Canaria, Servicio de Publicaciones y Difusión Científica, 2015-04-23)
      El presente trabajo toma como punto de partida la investigación que se describe en Muñoz Ramos (2012). En él haremos una breve síntesis del origen y evolución del EEES hasta llegar a nuestros días y su repercusión en los estudios de Traducción. Daremos cuenta de la imbricación existente entre los principios constitutivos del Proceso de Bolonia y las Tecnologías de la Información y Comunicación (TIC), que se posicionan como las compañeras idóneas para la consecución de los objetivos de la Declaración de Bolonia. Finalmente, podremos comprobar cómo estos dos puntos convergen en los nuevos grados en Traducción españoles, que se ajustan al EEES y encuentran en las materias de tecnologías de la traducción la piedra angular de su razón de ser.