• The Portrait of Dorian Gray: A corpus-based analysis of translated verb + noun (object) collocations in Peninsular and Colombian Spanish

      Valencia Giraldo, M. Victoria; Corpas Pastor, Gloria (Springer, 2019-09-18)
      Corpus-based Translation Studies have promoted research on the features of translated language, by focusing on the process and product of translation, from a descriptive perspective. Some of these features have been proposed by Toury [31] under the term of laws of translation, namely the law of growing standardisation and the law of interference. The law of standardisation appears to be particularly at play in diatopy, and more specifically in the case of transnational languages (e.g. English, Spanish, French, German). In fact, some studies have revealed the tendency to standardise the diatopic varieties of Spanish in translated language [8, 9, 11, 12]. This paper focuses on verb + noun (object) collocations of Spanish translations of The Portrait of Dorian Gray by Oscar Wilde. Two different varieties have been chosen (Peninsular and Colombian Spanish). Our main aim is to establish whether the Colombian Spanish translation actually matches the variety spoken in Colombia or it is closer to general or standard Spanish. For this purpose, the techniques used to translate this type of collocations in both Spanish translations will be analysed. Furthermore, the diatopic distribution of these collocations will be studied by means of large corpora.
    • Compilación de un corpus ad hoc para la enseñanza de la traducción inversa especializada

      Corpas Pastor, Gloria (University of Malaga, 2001-12-31)
      En este trabajo se exploran las posibilidades presentes y futuras que ofrece la lingüística del corpus para los Estudios de Traducción, con especial referencia a la vertiente pedagógica. En la actualidad, la investigación basada en corpus constituye un componente esencial de los sistemas de traducción automática, los programas de extracción terminológica y conceptual, los estudios contrastivos y la caracterización de la lengua traducida. Los dos tipos de corpus más utilizados para tales fines son los comparables y los paralelos. En este artículo, sin embargo, se parte de un corpus ad hoc de textos originales comparables en calidad de macrofuente de documentación para la enseñanza y el ejercicio profesional de la traducción inversa especializada.
    • Translating the discourse of medical tourism: A catalogue of resources and corpus for translators and researchers

      Davoust, E; Corpas Pastor, Gloria; Seghiri Domínguez, Miriam (The Slovak Association for the Study of English, 2018-12-18)
      The recent increase in medical tourism in Europe also means more written contents are translated on the web to get to potential clients. Translating cross-border care language is somehow challenging because it implies different agents and linguistic fields making it difficult for translators and researchers to be fully apprehended. We hereby present a catalogue of possible informative resources on medical tourism and an ad hoc corpus based on Spanish medical websites-focused on aesthetics and cosmetics-that were translated into English.
    • La variación fraseológica: análisis del rendimiento de los corpus monolingües como recursos de traducción

      Hidalgo-Ternero, Carlos Manuel; Corpas Pastor, Gloria (Faculty of Arts, Masaryk University, 2021-06-30)
      Las múltiples manifestaciones con las que se pueden presentar las unidades fraseológicas en el discurso (variación, flexión gramatical, discontinuidad…) hacen especialmente compleja la creación de patrones de búsqueda apropiados que permitan recuperarlas en todo su esplendor discursivo sin que ello implique un excesivo ruido documental. En este contexto, a lo largo del presente estudio se analiza el rendimiento de diferentes sistemas de gestión de corpus disponibles para el español en la consulta de las variantes fraseológicas tener entre manos, traer entre manos y llevar entre manos, e ir al pelo y venir al pelo. De forma concreta, se someterán a examen dos corpus creados por la RAE (el CREA, en sus versiones tradicional y anotada, y el CORPES XXI), el Corpus del Español de Mark Davies (BYU) y Sketch Engine. Los resultados arrojados por este análisis permitirán vislumbrar qué sistema de gestión de corpus ofrece un mejor rendimiento para los traductores ante el desafío de la variación fraseológica. Idioms tend to vary significantly in discourse (variation, grammatical inflection, discontinuity…). This makes it especially difficult to create appropriate query patterns that obtain these units in all shapes and forms while avoiding excessive noise. In this context, this paper analyses the performance of different corpus management systems available for Spanish when searching phraseological variants such as tener entre manos, traer entre manos and llevar entre manos, as well as ir al pelo and venir al pelo. More specifically, we will examine two corpora created by the Real Academia Española (CREA, in its original and annotated version, and CORPES XXI), the Corpus del Español by Mark Davies (BYU), and Sketch Engine. The results of our study will shed some light on which corpus management system can offer a better performance for translators under the challenge of idiom variation.
    • Technology solutions for interpreters: the VIP system

      Corpas Pastor, Gloria (Universidad de Valladolid, 2022-01-07)
      Interpreting technologies have abruptly entered the profession in recent years. However, technology still remains a relatively marginal topic of academic debate, although interest in developing tailor-made solutions for interpreters has risen sharply. This paper presents the VIP system, one of the research outputs of the homonymous project VIP - Voice-text Integrated system for interPreters, and its continuation (VIP II). More specifically, a technology-based terminology workflow for simultaneous interpretation is presented.
    • Interpreting and technology: Is the sky really the limit?

      Corpas Pastor, Gloria (INCOMA Ltd., 2021-07-05)
      Nowadays there is a pressing need to develop interpreting-related technologies, with practitioners and other end-users increasingly calling for tools tailored to their needs and their new interpreting scenarios. But, at the same time, interpreting as a human activity has resisted complete automation for various reasons, such as fear, unawareness, communication complexities, lack of dedicated tools, etc. Several computer-assisted interpreting tools and resources for interpreters have been developed, although they are rather modest in terms of the support they provide. In the same vein, and despite the pressing need to aiding in multilingual mediation, machine interpreting is still under development, with the exception of a few success stories. This paper will present the results of VIP, a R&D project on language technologies applied to interpreting. It is the ‘seed’ of a family of projects on interpreting technologies which are currently being developed or have just been completed at the Research Institute of Multilingual Language Technologies (IUITLM), University of Malaga.
    • Joint learning of morphology and syntax with cross-level contextual information flow

      Can Buglalilar, Burcu; Aleçakır, Hüseyin; Manandhar, Suresh; Bozşahin, Cem (Cambridge University Press, 2022-01-20)
      We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor (2018) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for future comparison.
    • A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English

      Sivakumar, Jasivan; Muga, Jake; Spadavecchia, Flavio; White, Daniel; Can Buglalilar, Burcu (IEEE, 2022-06-30)
      In this study, we propose a Gated Recurrent Unit (GRU) model to restore the following features: word and sentence boundaries, periods, commas, and capitalisation for unformatted English text. We approach feature restoration as a binary classification task where the model learns to predict whether a feature should be restored or not. A pipeline approach is proposed, in which only one feature (word boundary, sentence boundary, punctuation, capitalisation) is restored in each component of the pipeline model. To optimise the model, we conducted a grid search on the parameters. The effect of changing the order of the pipeline is also investigated experimentally; PERIODS > COMMAS > SPACES > CASING yielded the best result. Our findings highlight several specifcaction points with optimisation potential to be targeted in follow-up research.
    • Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization

      Safder, Iqra; Batool, Hafsa; Sarwar, Raheem; Zaman, Farooq; Aljohani, Naif Radi; Nawaz, Raheel; Gaber, Mohamed; Hassan, Saeed-Ul (Taylor & Francis, 2021-11-14)
      Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function's graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/.
    • Author verification of Nahj Al-Balagha

      Sarwar, Raheem; Mohamed, Emad (Oxford University Press, 2022-01-20)
      The primary purpose of this paper is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi’i Muslims are proposing different theories. Given the morphologically complex nature of Arabic, we test whether morphological segmentation, applied to the book and works by the two authors suspected by Sunnis to have authored the texts, can be used for author verification of the Nahj Al-Balagha. Our findings indicate that morphological segmentation may lead to slightly better results than whole words, and that regardless of the feature sets, the three sub-corpora cluster into three distinct groups using Principal Component Analysis, Hierarchical Clustering, Multi-dimensional Scaling and Bootstrap Consensus Trees. Supervised classification methods such as Naive Bayes, Support Vector Machines, k Nearest Neighbours, Random Forests, AdaBoost, Bagging and Decision Trees confirm the same results, which is a clear indication that (a) the book is internally consistent and can thus be attributed to a single person, and (b) it was not authored by either of the suspected authors.
    • Extracción de fraseología para intérpretes a partir de corpus comparables compilados mediante reconocimiento automático del habla

      Corpas Pastor, Gloria; Gaber, Mahmoud; Corpas Pastor, Gloria; Bautista Zambrana, María Rosario; Hidalgo Ternero, Carlos Manuel (Editorial Comares, 2021-10-04)
      Today, automatic speech recognition is beginning to emerge strongly in the field of interpreting. Recent studies point to this technology as one of the main documentation resources for interpreters, among other possible uses. In this paper we present a novel documentation methodology that involves semi-automatic compilation of comparable corpora (transcriptions of oral speeches) and automatic corpus compilation of written documents on the same topic with a view to preparing an interpreting assignment. As a convenient background, we provide a brief overview of the use of automatic speech recognition in the context of interpreting technologies. Next, we will detail the protocol for designing and compiling our comparable corpora that we will exploit for analysis. In the last part of the paper, we will cover phraseology extraction and study some collocational patterns in both corpora. Mastering the specific phraseology of the specific subject matter of the assignment is one of the main stumbling blocks that interpreters face in their daily work. Our ultimate aim is to establish whether oral corpora could be of further benefit to the interpreter in the preliminary preparation phase.
    • Translationese and register variation in English-to-Russian professional translation

      Kunilovskaya, Maria; Corpas Pastor, Gloria; Wang, Vincent; Lim, Lily; Li, Defeng (Springer Singapore, 2021-10-12)
      This study explores the impact of register on the properties of translations. We compare sources, translations and non-translated reference texts to describe the linguistic specificity of translations common and unique between four registers. Our approach includes bottom-up identification of translationese effects that can be used to define translations in relation to contrastive properties of each register. The analysis is based on an extended set of features that reflect morphological, syntactic and text-level characteristics of translations. We also experiment with lexis-based features from n-gram language models estimated on large bodies of originally- authored texts from the included registers. Our parallel corpora are built from published English-to-Russian professional translations of general domain mass-media texts, popular-scientific books, fiction and analytical texts on political and economic news. The number of observations and the data sizes for parallel and reference components are comparable within each register and range from 166 (fiction) to 525 (media) text pairs; from 300,000 to 1 million tokens. Methodologically, the research relies on a series of supervised and unsupervised machine learning techniques, including those that facilitate visual data exploration. We learn a number of text classification models and study their performance to assess our hypotheses. Further on, we analyse the usefulness of the features for these classifications to detect the best translationese indicators in each register. The multivariate analysis via text classification is complemented by univariate statistical analysis which helps to explain the observed deviation of translated registers through a number of translationese effects and detect the features that contribute to them. Our results demonstrate that each register generates a unique form of translationese that can be only partially explained by cross-linguistic factors. Translated registers differ in the amount and type of prevalent translationese. The same translationese tendencies in different registers are manifested through different features. In particular, the notorious shining-through effect is more noticeable in general media texts and news commentary and is less prominent in fiction.
    • Source language difficulties in learner translation: Evidence from an error-annotated corpus

      Kunilovskaia, Mariia; Ilyushchenya, Tatyana; Morgoun, Natalia; Mitkov, Ruslan (John Benjamins Publishing, 2022-06-30)
      This study uses an error-annotated, mass-media subset of a sentence-aligned, multi-parallel learner translator corpus, to reveal source language items that are challenging in English-to-Russian translation. Our data includes multiple translations to most challenging source sentences, distilled from a large collection of student translations on the basis of error statistics. This sample was subjected to manual contrastive-comparative analysis, which resulted in a list of English items that were difficult to students. The outcome of the analysis was compared to the topics discussed in dozens of translation textbooks that are recommended to BA and specialist-degree students in Russia at the initial stage of professional education. We discuss items that deserve more prominence in training as well as items that call for improvements to traditional learning activities. This study presents evidence that a more empirically-motivated design of practical translation syllabus as part of translator education is required.
    • Findings of the WMT 2021 shared task on quality estimation

      Specia, Lucia; Blain, Frederic; Fomicheva, Marina; Zerva, Chrysoula; Li, Zhenhao; Chaudhary, Vishrav; Martins, André (Association for Computational Linguistics, 2021-11-10)
      We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels. This edition focused on two main novel additions: (i) prediction for unseen languages, i.e. zero-shot settings, and (ii) prediction of sentences with catastrophic errors. In addition, new data was released for a number of languages, especially post-edited data. Participating teams from 19 institutions submitted altogether 1263 systems to different task variants and language pairs.
    • Using linguistic features to predict the response process complexity associated with answering clinical MCQs

      Yaneva, Victoria; Jurich, Daniel; Ha, Le An; Baldwin, Peter (Association for Computational Linguistics, 2021-04-30)
      This study examines the relationship between the linguistic characteristics of a test item and the complexity of the response process required to answer it correctly. Using data from a large-scale medical licensing exam, clustering methods identified items that were similar with respect to their relative difficulty and relative response-time intensiveness to create low response process complexity and high response process complexity item classes. Interpretable models were used to investigate the linguistic features that best differentiated between these classes from a descriptive and predictive framework. Results suggest that nuanced features such as the number of ambiguous medical terms help explain response process complexity beyond superficial item characteristics such as word count. Yet, although linguistic features carry signal relevant to response process complexity, the classification of individual items remains challenging.
    • An exploratory analysis of multilingual word-level quality estimation with cross-lingual transformers

      Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2021-08-31)
      Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that these QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.
    • deepQuest-py: large and distilled models for quality estimation

      Alva-Manchego, Fernando; Obamuyide, Abiola; Gajbhiye, Amit; Blain, Frederic; Fomicheva, Marina; Specia, Lucia; Adel, Heike; Shi, Shuming (Association for Computational Linguistics, 2021-11-01)
      We introduce deepQuest-py, a framework for training and evaluation of large and lightweight models for Quality Estimation (QE). deepQuest-py provides access to (1) state-ofthe-art models based on pre-trained Transformers for sentence-level and word-level QE; (2) light-weight and efficient sentence-level models implemented via knowledge distillation; and (3) a web interface for testing models and visualising their predictions. deepQuestpy is available at https://github.com/ sheffieldnlp/deepQuest-py under a CC BY-NC-SA licence.
    • Pushing the right buttons: adversarial evaluation of quality estimation

      Kanojia, Diptesh; Fomicheva, Marina; Ranasinghe, Tharindu; Blain, Frederic; Orasan, Constantin; Specia, Lucia; Barrault, Loric; Bojar, Ondrej; Bougaris, Fethi; Chatterjee, Rajen; et al. (Association for Computational Linguistics, 2022-01-11)
      Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaningpreserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation.
    • Robust fragment-based framework for cross-lingual sentence retrieval

      Trijakwanich, Nattapol; Limkonchotiwat, Peerat; Sarwar, Raheem; Phatthiyaphaibun, Wannaphong; Chuangsuwanich, Ekapol; Nutanong, Sarana; Moens, Marie-Francine; Huan, Xuanjing; Specia, Lucia; Yih, Scott Wen-tau (Association for Computational Linguistics, 2021-11-01)
      Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of- Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.