• Identification of multiword expressions: A fresh look at modelling and evaluation

      Taslimipoor, Shiva; Rohanian, Omid; Mitkov, Ruslan; Fazly, Afsaneh; Markantonatou, Stella; Ramisch, Carlos; Savary, Agata; Vincze, Veronika (Language Science Press, 2018-10-25)
    • Identification of translationese: a machine learning approach

      Ilisei, Iustina; Inkpen, Diana; Corpas Pastor, Gloria; Mitkov, Ruslan; Gelbukh, A (Springer, 2010)
      This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.
    • Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

      Evans, Richard; Orasan, Constantin (Cambridge University Press, 2018-10-31)
      This article presents a new method to automatically simplify English sentences. The approach is designed to reduce the number of compound clauses and nominally bound relative clauses in input sentences. The article provides an overview of a corpus annotated with information about various explicit signs of syntactic complexity and describes the two major components of a sentence simplification method that works by exploiting information on the signs occurring in the sentences of a text. The first component is a sign tagger which automatically classifies signs in accordance with the annotation scheme used to annotate the corpus. The second component is an iterative rule-based sentence transformation tool. Exploiting the sign tagger in conjunction with other NLP components, the sentence transformation tool automatically rewrites long sentences containing compound clauses and nominally bound relative clauses as sequences of shorter single-clause sentences. Evaluation of the different components reveals acceptable performance in rewriting sentences containing compound clauses but less accuracy when rewriting sentences containing nominally bound relative clauses. A detailed error analysis revealed that the major sources of error include inaccurate sign tagging, the relatively limited coverage of the rules used to rewrite sentences, and an inability to discriminate between various subtypes of clause coordination. Despite this, the system performed well in comparison with two baselines. This finding was reinforced by automatic estimations of the readability of system output and by surveys of readers’ opinions about the accuracy, accessibility, and meaning of this output.
    • Improving translation memory matching and retrieval using paraphrases

      Gupta, Rohit; Orasan, Constantin; Zampieri, Marcos; Vela, Mihaela; van Genabith, Josef; Mitkov, Ruslan (Springer Nature, 2016-11-02)
      Most of the current Translation Memory (TM) systems work on string level (character or word level) and lack semantic knowledge while matching. They use simple edit-distance calculated on surface-form or some variation on it (stem, lemma), which does not take into consideration any semantic aspects in matching. This paper presents a novel and efficient approach to incorporating semantic information in the form of paraphrasing in the edit-distance metric. The approach computes edit-distance while efficiently considering paraphrases using dynamic programming and greedy approximation. In addition to using automatic evaluation metrics like BLEU and METEOR, we have carried out an extensive human evaluation in which we measured post-editing time, keystrokes, HTER, HMETEOR, and carried out three rounds of subjective evaluations. Our results show that paraphrasing substantially improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase-enhanced TMs.
    • Incorporating word embeddings in unsupervised morphological segmentation

      Üstün, Ahmet; Can, Burcu (Cambridge University Press (CUP), 2020-07-10)
      © The Author(s), 2020. Published by Cambridge University Press. We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.
    • Incremental adaptation using translation informations and post-editing analysis

      Blain, Frederic; Schwenk, Holger; Senellart, Jean (IWSLT, 2012-12-06)
      It is well known that statistical machine translation systems perform best when they are adapted to the task. In this paper we propose new methods to quickly perform incremental adaptation without the need to obtain word-by-word alignments from GIZA or similar tools. The main idea is to use an automatic translation as pivot to infer alignments between the source sentence and the reference translation, or user correction. We compared our approach to the standard method to perform incremental re-training. We achieve similar results in the BLEU score using less computational resources. Fast retraining is particularly interesting when we want to almost instantly integrate user feed-back, for instance in a post-editing context or machine translation assisted CAT tool. We also explore several methods to combine the translation models.
    • Inteliterm: in search of efficient terminology lookup tools for translators

      Corpas Pastor, G.; Durán-Muñoz, Isabel; Domínguez Vázquez, María José; Mirazo Balsa, Mónica; Valcárcel Riveiro, Carlos (De Gruyter, 2019-12-16)
    • Intelligent Natural Language Processing: Trends and Applications

      Orăsan, Constantin; Evans, Richard; Mitkov, Ruslan (Springer, 2017)
      Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.
    • Intelligent text processing to help readers with autism

      Orăsan, C; Evans, R; Mitkov, R (Springer International Publishing, 2017-11-18)
      © 2018, Springer International Publishing AG. Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.
    • Interpreting correlations between citation counts and other indicators

      Thelwall, Mike (Springer, 2016-05-09)
      Altmetrics or other indicators for the impact of academic outputs are often correlated with citation counts in order to help assess their value. Nevertheless, there are no guidelines about how to assess the strengths of the correlations found. This is a problem because this value affects the conclusions that should be drawn. In response, this article uses experimental simulations to assess the correlation strengths to be expected under various different conditions. The results show that the correlation strength reflects not only the underlying degree of association but also the average magnitude of the numbers involved. Overall, the results suggest that due to the number of assumptions that must be made in practice it will rarely be possible to make a realistic interpretation of the strength of a correlation coefficient.
    • Interpreting social science link analysis research: A theoretical framework

      Thelwall, Mike (Wiley, 2006)
      Link analysis in various forms is now an established technique in many different subjects, reflecting the perceived importance of links and of the Web. A critical but very difficult issue is how to interpret the results of social science link analyses. It is argued that the dynamic nature of the Web, its lack of quality control, and the online proliferation of copying and imitation mean that methodologies operating within a highly positivist, quantitative framework are ineffective. Conversely, the sheer variety of the Web makes application of qualitative methodologies and pure reason very problematic to large-scale studies. Methodology triangulation is consequently advocated, in combination with a warning that the Web is incapable of giving definitive answers to large-scale link analysis research questions concerning social factors underlying link creation. Finally, it is claimed that although theoretical frameworks are appropriate for guiding research, a Theory of Link Analysis is not possible.
    • Introduction

      Corpas Pastor, Gloria; Colson, Jean-Pierre (John Benjamins Publishing Company, 2020-05-08)
    • Is Medical Research Informing Professional Practice More Highly Cited? Evidence from AHFS DI Essentials in Drugs.com

      Thelwall, Mike; Kousha, Kayvan; Abdoli, Mahshid (Springer, 2017-02-21)
      Citation-based indicators are often used to help evaluate the impact of published medical studies, even though the research has the ultimate goal of improving human wellbeing. One direct way of influencing health outcomes is by guiding physicians and other medical professionals about which drugs to prescribe. A high profile source of this guidance is the AHFS DI Essentials product of the American Society of Health-System Pharmacists, which gives systematic information for drug prescribers. AHFS DI Essentials documents, which are also indexed by Drugs.com, include references to academic studies and the referenced work is therefore helping patients by guiding drug prescribing. This article extracts AHFS DI Essentials documents from Drugs.com and assesses whether articles referenced in these information sheets have their value recognised by higher Scopus citation counts. A comparison of mean log-transformed citation counts between articles that are and are not referenced in AHFS DI Essentials shows that AHFS DI Essentials references are more highly cited than average for the publishing journal. This suggests that medical research influencing drug prescribing is more cited than average.
    • “Keep it simple!”: an eye-tracking study for exploring complexity and distinguishability of web pages for people with autism

      Eraslan, Sukru; Yesilada, Yeliz; Yaneva, Victoria; Ha, Le An (Springer Science and Business Media LLC, 2020-02-03)
      A major limitation of the international well-known standard web accessibility guidelines for people with cognitive disabilities is that they have not been empirically evaluated by using relevant user groups. Instead, they aim to anticipate issues that may arise following the diagnostic criteria. In this paper, we address this problem by empirically evaluating two of the most popular guidelines related to the visual complexity of web pages and the distinguishability of web-page elements. We conducted a comparative eye-tracking study with 19 verbal and highly independent people with autism and 19 neurotypical people on eight web pages with varying levels of visual complexity and distinguishability, with synthesis and browsing tasks. Our results show that people with autism have a higher number of fixations and make more transitions with synthesis tasks. When we consider the number of elements which are not related to given tasks, our analysis shows that they look at more irrelevant elements while completing the synthesis task on visually complex pages or on pages whose elements are not easily distinguishable. To the best of our knowledge, this is the first empirical behavioural study which evaluates these guidelines by showing that the high visual complexity of pages or the low distinguishability of page elements causes non-equivalent experience for people with autism.
    • Knowledge distillation for quality estimation

      Gajbhiye, Amit; Fomicheva, Marina; Alva-Manchego, Fernando; Blain, Frederic; Obamuyide, Abiola; Aletras, Nikolaos; Specia, Lucia (Association for Computational Linguistics, 2021-08-01)
      Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models trained on distilled pre-trained representations remain prohibitively large for many usage scenarios. We instead propose to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
    • La tecnología habla-texto como herramienta de documentación para intérpretes: Nuevo método para compilar un corpus ad hoc y extraer terminología a partir de discursos orales en vídeo

      Gaber, Mahmoud; Corpas Pastor, Gloria; Omer, Ahmed (Malaga University, 2020-12-22)
      Although interpreting has not yet benefited from technology as much as its sister field, translation, interest in developing tailor-made solutions for interpreters has risen sharply in recent years. In particular, Automatic Speech Recognition (ASR) is being used as a central component of Computer-Assisted Interpreting (CAI) tools, either bundled or standalone. This study pursues three main aims: (i) to establish the most suitable ASR application for building ad hoc corpora by comparing several ASR tools and assessing their performance; (ii) to use ASR in order to extract terminology from the transcriptions obtained from video-recorded speeches, in this case talks on climate change and adaptation; and (iii) to promote the adoption of ASR as a new documentation tool among interpreters. To the best of our knowledge, this is one of the first studies to explore the possibility of Speech-to-Text (S2T) technology for meeting the preparatory needs of interpreters as regards terminology and background/domain knowledge.
    • Language evolution and the spread of ideas on the Web: A procedure for identifying emergent hybrid word family members

      Thelwall, Mike; Price, Liz (Wiley, 2006)
      Word usage is of interest to linguists for its own sake as well as to social scientists and others who seek to track the spread of ideas, for example, in public debates over political decisions. The historical evolution of language can be analyzed with the tools of corpus linguistics through evolving corpora and the Web. But word usage statistics can only be gathered for known words. In this article, techniques are described and tested for identifying new words from the Web, focusing on the case when the words are related to a topic and have a hybrid form with a common sequence of letters. The results highlight the need to employ a combination of search techniques and show the wide potential of hybrid word family investigations in linguistics and social science.
    • Language resources for Italian: Towards the development of a corpus of annotated Italian multiword expressions

      Taslimipoor, Shiva; Desantis, Anna; Cherchi, Manuela; Mitkov, Ruslan; Monti, Johanna (ceur-ws, 2016-12-05)
      This paper describes the first resource annotated for multiword expressions (MWEs) in Italian. Two versions of this dataset have been prepared: the first with a fast markup list of out-of-context MWEs, and the second with an in-context annotation, where the MWEs are entered with their contexts. The paper also discusses annotation issues and reports the inter-annotator agreement for both types of annotations. Finally, the results of the first exploitation of the new resource, namely the automatic extraction of Italian MWEs, are presented.
    • Large-scale data harvesting for biographical data

      Plum, Alistair; Zampieri, Marcos; Orasan, Constantin; Wandl-Vogt, Eveline; Mitkov, R (CEUR, 2019-09-05)
      This paper explores automatic methods to identify relevant biography candidates in large databases, and extract biographical information from encyclopedia entries and databases. In this work, relevant candidates are defined as people who have made an impact in a certain country or region within a pre-defined time frame. We investigate the case of people who had an impact in the Republic of Austria and died between 1951 and 2019. We use Wikipedia and Wikidata as data sources and compare the performance of our information extraction methods on these two databases. We demonstrate the usefulness of a natural language processing pipeline to identify suitable biography candidates and, in a second stage, extract relevant information about them. Even though they are considered by many as an identical resource, our results show that the data from Wikipedia and Wikidata differs in some cases and they can be used in a complementary way providing more data for the compilation of biographies.
    • Las tecnologías de interpretación a distancia en los servicios públicos: uso e impacto

      Gaber, Mahmoud; Corpas Pastor, Gloria; Postigo Pinazo, Encarnación (Peter Lang, 2020-02-27)
      This chapter deals with the use of distance interpreting technologies and their impact on public services interpreters. Remote (or distance) interpreting offers a wide range of solutions in order to successfully satisfy the pressing need for languages services in both the public and private sectors. This study focuses on telephone-mediated and video-mediated interpreting, presenting their advantages and disadvantages. We have designed a survey to gather data about the psychological and physiological impact that remote interpreting technologies generate in community interpreters. Our main aim is to ascertain interpreters’ general view on technology, so as to detect deficiencies and suggest ways of improvement. This study is a first contribution in the direction of optimising distance interpreting technologies. Current demand reveals the enormous potential of distance interpreting, its rapid evolution and generalised presence that this modality will have in the future.