• Multilingual offensive language identification for low-resource languages

      Ranasinghe, Tharindu; Zampieri, Marcos (Association for Computing Machinery, 2021-11-10)
      Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.
    • Robust fragment-based framework for cross-lingual sentence retrieval

      Trijakwanich, Nattapol; Limkonchotiwat, Peerat; Sarwar, Raheem; Phatthiyaphaibun, Wannaphong; Chuangsuwanich, Ekapol; Nutanong, Sarana; Moens, Marie-Francine; Huan, Xuanjing; Specia, Lucia; Yih, Scott Wen-tau (Association for Computational Linguistics, 2021-11-01)
      Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of- Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.
    • deepQuest-py: large and distilled models for quality estimation

      Alva-Manchego, Fernando; Obamuyide, Abiola; Gajbhiye, Amit; Blain, Frederic; Fomicheva, Marina; Specia, Lucia; Adel, Heike; Shi, Shuming (Association for Computational Linguistics, 2021-11-01)
      We introduce deepQuest-py, a framework for training and evaluation of large and lightweight models for Quality Estimation (QE). deepQuest-py provides access to (1) state-ofthe-art models based on pre-trained Transformers for sentence-level and word-level QE; (2) light-weight and efficient sentence-level models implemented via knowledge distillation; and (3) a web interface for testing models and visualising their predictions. deepQuestpy is available at https://github.com/ sheffieldnlp/deepQuest-py under a CC BY-NC-SA licence.
    • Urdu AI: writeprints for Urdu authorship identification

      Sarwar, Raheem; Hassan, Saeed-Ul (Association for Computing Machinery, 2021-10-31)
      The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.
    • Translationese and register variation in English-to-Russian professional translation

      Kunilovskaya, Maria; Corpas Pastor, Gloria; Wang, Vincent; Lim, Lily; Li, Defeng (Springer Singapore, 2021-10-12)
      This study explores the impact of register on the properties of translations. We compare sources, translations and non-translated reference texts to describe the linguistic specificity of translations common and unique between four registers. Our approach includes bottom-up identification of translationese effects that can be used to define translations in relation to contrastive properties of each register. The analysis is based on an extended set of features that reflect morphological, syntactic and text-level characteristics of translations. We also experiment with lexis-based features from n-gram language models estimated on large bodies of originally- authored texts from the included registers. Our parallel corpora are built from published English-to-Russian professional translations of general domain mass-media texts, popular-scientific books, fiction and analytical texts on political and economic news. The number of observations and the data sizes for parallel and reference components are comparable within each register and range from 166 (fiction) to 525 (media) text pairs; from 300,000 to 1 million tokens. Methodologically, the research relies on a series of supervised and unsupervised machine learning techniques, including those that facilitate visual data exploration. We learn a number of text classification models and study their performance to assess our hypotheses. Further on, we analyse the usefulness of the features for these classifications to detect the best translationese indicators in each register. The multivariate analysis via text classification is complemented by univariate statistical analysis which helps to explain the observed deviation of translated registers through a number of translationese effects and detect the features that contribute to them. Our results demonstrate that each register generates a unique form of translationese that can be only partially explained by cross-linguistic factors. Translated registers differ in the amount and type of prevalent translationese. The same translationese tendencies in different registers are manifested through different features. In particular, the notorious shining-through effect is more noticeable in general media texts and news commentary and is less prominent in fiction.
    • Extracción de fraseología para intérpretes a partir de corpus comparables compilados mediante reconocimiento automático del habla

      Corpas Pastor, Gloria; Gaber, Mahmoud; Corpas Pastor, Gloria; Bautista Zambrana, María Rosario; Hidalgo Ternero, Carlos Manuel (Editorial Comares, 2021-10-04)
      Today, automatic speech recognition is beginning to emerge strongly in the field of interpreting. Recent studies point to this technology as one of the main documentation resources for interpreters, among other possible uses. In this paper we present a novel documentation methodology that involves semi-automatic compilation of comparable corpora (transcriptions of oral speeches) and automatic corpus compilation of written documents on the same topic with a view to preparing an interpreting assignment. As a convenient background, we provide a brief overview of the use of automatic speech recognition in the context of interpreting technologies. Next, we will detail the protocol for designing and compiling our comparable corpora that we will exploit for analysis. In the last part of the paper, we will cover phraseology extraction and study some collocational patterns in both corpora. Mastering the specific phraseology of the specific subject matter of the assignment is one of the main stumbling blocks that interpreters face in their daily work. Our ultimate aim is to establish whether oral corpora could be of further benefit to the interpreter in the preliminary preparation phase.
    • A sequence labelling approach for automatic analysis of ello: tagging pronouns, antecedents, and connective phrases

      Parodi, Giovanni; Evans, Richard; Ha, Le An; Mitkov, Ruslan; Julio, Cristóbal; Olivares-López, Raúl Ignacio (Springer, 2021-09-04)
      Encapsulators are linguistic units which establish coherent referential connections to the preceding discourse in a text. In this paper, we address the challenge of automatically analysing the pronominal encapsulator ello in Spanish text. Our method identifies, for each occurrence, the antecedent of the pronoun (including its grammatical type), the connective phrase which combines with the pronoun to express a discourse relation linking the antecedent text segment to the following text segment, and the type of semantic relation expressed by the complex discourse marker formed by the connective phrase and pronoun. We describe our annotation of a corpus to inform the development of our method and to finetune an automatic analyser based on bidirectional encoder representation transformers (BERT). On testing our method, we find that it performs with greater accuracy than three baselines (0.76 for the resolution task), and sets a promising benchmark for the automatic annotation of occurrences of the pronoun ello, their antecedents, and the semantic relations between the two text segments linked by the connective in combination with the pronoun.
    • An exploratory analysis of multilingual word-level quality estimation with cross-lingual transformers

      Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2021-08-31)
      Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that these QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.
    • Handling cross and out-of-domain samples in Thai word segmentation

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-08-01)
      While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.
    • Knowledge distillation for quality estimation

      Gajbhiye, Amit; Fomicheva, Marina; Alva-Manchego, Fernando; Blain, Frederic; Obamuyide, Abiola; Aletras, Nikolaos; Specia, Lucia (Association for Computational Linguistics, 2021-08-01)
      Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models trained on distilled pre-trained representations remain prohibitively large for many usage scenarios. We instead propose to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
    • SemEval-2021 task 1: Lexical complexity prediction

      Shardlow, Matthew; Evans, Richard; Paetzold, Gustavo Henrique; Zampieri, Marcos (Association for Computational Linguistics, 2021-08-01)
      This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.
    • Combining text and images for film age appropriateness classification

      Ha, Le; Mohamed, Emad (Elsevier, 2021-07-14)
      We combine textual information from a corpus of film scripts and the images of important scenes from IMDB that correspond to these films to create a bimodal dataset (the dataset and scripts can be obtained from https://tinyurl.com/se9tlmr) for film age appropriateness classification with the objective of improving the prediction of age appropriateness for parents and children. We use state-of-the art Deep Learning image feature extraction, including DENSENet, ResNet, Inception, and NASNet. We have tested several Machine learning algorithms and have found xgboost to yield the best results. Previously reported classification accuracy, using only textual features, were 79.1% and 65.3% for American MPAA and British BBFC classification respectively. Using images alone, we achieve 64.8% and 56.7% classification accuracy. The most consistent combination of textual features and images’ features achieves 81.1% and 66.8%, both statistically significant improvements over the use of text only.
    • Interpreting and technology: Is the sky really the limit?

      Corpas Pastor, Gloria (INCOMA Ltd., 2021-07-05)
      Nowadays there is a pressing need to develop interpreting-related technologies, with practitioners and other end-users increasingly calling for tools tailored to their needs and their new interpreting scenarios. But, at the same time, interpreting as a human activity has resisted complete automation for various reasons, such as fear, unawareness, communication complexities, lack of dedicated tools, etc. Several computer-assisted interpreting tools and resources for interpreters have been developed, although they are rather modest in terms of the support they provide. In the same vein, and despite the pressing need to aiding in multilingual mediation, machine interpreting is still under development, with the exception of a few success stories. This paper will present the results of VIP, a R&D project on language technologies applied to interpreting. It is the ‘seed’ of a family of projects on interpreting technologies which are currently being developed or have just been completed at the Research Institute of Multilingual Language Technologies (IUITLM), University of Malaga.
    • La variación fraseológica: análisis del rendimiento de los corpus monolingües como recursos de traducción

      Hidalgo-Ternero, Carlos Manuel; Corpas Pastor, Gloria (Faculty of Arts, Masaryk University, 2021-06-30)
      Las múltiples manifestaciones con las que se pueden presentar las unidades fraseológicas en el discurso (variación, flexión gramatical, discontinuidad…) hacen especialmente compleja la creación de patrones de búsqueda apropiados que permitan recuperarlas en todo su esplendor discursivo sin que ello implique un excesivo ruido documental. En este contexto, a lo largo del presente estudio se analiza el rendimiento de diferentes sistemas de gestión de corpus disponibles para el español en la consulta de las variantes fraseológicas tener entre manos, traer entre manos y llevar entre manos, e ir al pelo y venir al pelo. De forma concreta, se someterán a examen dos corpus creados por la RAE (el CREA, en sus versiones tradicional y anotada, y el CORPES XXI), el Corpus del Español de Mark Davies (BYU) y Sketch Engine. Los resultados arrojados por este análisis permitirán vislumbrar qué sistema de gestión de corpus ofrece un mejor rendimiento para los traductores ante el desafío de la variación fraseológica. Idioms tend to vary significantly in discourse (variation, grammatical inflection, discontinuity…). This makes it especially difficult to create appropriate query patterns that obtain these units in all shapes and forms while avoiding excessive noise. In this context, this paper analyses the performance of different corpus management systems available for Spanish when searching phraseological variants such as tener entre manos, traer entre manos and llevar entre manos, as well as ir al pelo and venir al pelo. More specifically, we will examine two corpora created by the Real Academia Española (CREA, in its original and annotated version, and CORPES XXI), the Corpus del Español by Mark Davies (BYU), and Sketch Engine. The results of our study will shed some light on which corpus management system can offer a better performance for translators under the challenge of idiom variation.
    • Sentiment analysis for Urdu online reviews using deep learning models

      Safder, Iqra; Mehmood, Zainab; Sarwar, Raheem; Hassan, Saeed-Ul; Zaman, Farooq; Adeel Nawab, Rao Muhammad; Bukhari, Faisal; Ayaz Abbasi, Rabeeh; Alelyani, Salem; Radi Aljohani, Naif; et al. (Wiley, 2021-06-28)
      Most existing studies are focused on popular languages like English, Spanish, Chinese, Japanese, and others, however, limited attention has been paid to Urdu despite having more than 60 million native speakers. In this paper, we develop a deep learning model for the sentiments expressed in this under-resourced language. We develop an open-source corpus of 10,008 reviews from 566 online threads on the topics of sports, food, software, politics, and entertainment. The objectives of this work are bi-fold (1) the creation of a human-annotated corpus for the research of sentiment analysis in Urdu; and (2) measurement of up-to-date model performance using a corpus. For their assessment, we performed binary and ternary classification studies utilizing another model, namely LSTM, RCNN Rule-Based, N-gram, SVM, CNN, and LSTM. The RCNN model surpasses standard models with 84.98 % accuracy for binary classification and 68.56 % accuracy for ternary classification. To facilitate other researchers working in the same domain, we have open-sourced the corpus and code developed for this research.
    • Decálogo de características de la literatura poscolonial: propuesta de una taxonomía para la crítica literaria y los estudios de literatura comparada

      Fernández Ruiz, María Remedios; Corpas Pastor, Gloria; Seghiri, Míriam (Editorial CSIC, 2021-06-22)
      El objetivo de este artículo es ofrecer una propuesta de clasificación de los rasgos presentes, en mayor o menor medida, en la literatura poscolonial en cualquier idioma. A pesar de que esta taxonomía toma como punto de partida definiciones teóricas previas de los conceptos clave relacionados con la literatura poscolonial (Edwards 2008, Nayar 2008 y Ramone 2011), parece ser la primera clasificación formal que se ha elaborado al respecto. De este modo, se analizan conceptos consolidados a la par que presenta la nueva noción de plasticidad de géneros literarios y explora las corrientes actuales en la investigación de la interseccionalidad. Como resultado, proporcionaremos un decálogo de características de la literatura poscolonial que favorecerá la crítica literaria y los estudios de literatura comparada.
    • Management of 201 individuals with emotionally unstable personality disorders: A naturalistic observational study in real-world inpatient setting

      Shahpesandy, Homayun; Mohammed-Ali, Rosemary; Oakes, Michael; Al-Kubaisy, Tarik; Cheetham, Anna; Anene, Moses; The Hartsholme Centre, Long Leys Road, Lincoln, LN1 1FS, Lincolnshire NHS Foundation Trust, UK. (Maghira & Maas Publications, 2021-06-03)
      BACKGROUND: Emotionally unstable personality disorder (EUPD) is a challenging condition with a prevalence of 20% in inpatient services. Psychotherapy is the preferred treatment; nevertheless, off-license medications are widely used. OBJECTIVES: To identify socio-demographics, clinical and service-delivery characteristics of people with EUPD admitted to inpatient services between 1st January 2017 and 31st December 2018. METHODS: A retrospective review using data from patients' records. Individuals, age 18-65 were included. Statistical analysis was conducted by the Mann-Whitney-Wilcoxon test and Chi-squared test with Yates's continuity correction. RESULTS: Of 1646 inpatients, 201 (12.2%); had the diagnosis of EUPD; 133 (66.0%) women, 68 (44.0%). EUPD was significantly (P < .001) more prevalent in women (18.2%) than men (7.4%). EUPD patients were significantly (P < .001) younger (32.2 years) than patients without EUPD (46 years), and had significantly (P < .001) more admissions (1.74) than patients without EUPD (1.2 admission). 70.5% of patients had one and 17.0% two Axis-I psychiatric co-morbidities. Substance use was significantly (P < .001) more often in men (57.3%) than in women (28.5%). Significantly (P = 0.047) more women (68.4%) than men (53.0%) reported sexual abuse. 87.5% used polypharmacy. Antidepressants were significantly (P = 0.02) often prescribed to women (76.6%) than men (69.1%). Significantly (P = 0.02) more women (83.5%) than men (67.6%) were on antipsychotics. 57.2% of the patients were on anxiolytics, 40.0% on hypnotics and 25.8% on mood stabilisers. CONCLUSION: EUPD is a complex condition with widespread comorbidity. The term EUPD, Borderline Personality Disorder is unsuitable, stigmatising and too simplistic to reflect the nature, gravity and psychopathology of this syndrome.
    • Backtranslation feedback improves user confidence in MT, not quality

      Zouhar, Vilém; Novák, Michal; Žilinec, Matúš; Bojar, Ondřej; Obregón, Mateo; Hill, Robin L; Blain, Frédéric; Fomicheva, Marina; Specia, Lucia; Yankovskaya, Lisa; et al. (Association for Computational Linguistics, 2021-06-01)
      Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.
    • Herramientas y recursos electrónicos para la traducción de la manipulación fraseológica: un estudio de caso centrado en el estudiante

      Hidalgo Ternero, Carlos Manuel; Corpas Pastor, Gloria (Ediciones Universidad de Salamanca, 2021-05-13)
      En el presente artículo se analiza un estudio de caso llevado a cabo con estudiantes de la asignatura Traducción General «BA-AB» (II) - Inglés-Español / EspañolInglés, impartida en el segundo semestre del segundo curso del Grado en Traducción e Interpretación de la Universidad de Málaga. En él, en una primera fase, se les enseñó a los estudiantes cómo sacar el máximo partido de diferentes recursos y herramientas documentales electrónicos (corpus lingüísticos, recursos lexicográficos o la web, entre otros) para la creación de equivalencias textuales en aquellos casos en los que, fruto del anisomorfismo fraseológico interlingüe, la modificación creativa de unidades fraseológicas (UF) en el texto origen y la ausencia de correspondencias biunívocas presentan serias dificultades para el proceso traslaticio. De esta manera, a una primera actividad formativa sobre la traducción de usos creativos de unidades fraseológicas le sucede una sesión práctica en la que los alumnos tuvieron que enfrentarse a distintos casos de manipulación en el texto origen. Con el análisis de dichos resultados se podrá vislumbrar en qué medida los distintos recursos documentales ayudan a los traductores en formación a superar el desafío de la manipulación fraseológica
    • Using linguistic features to predict the response process complexity associated with answering clinical MCQs

      Yaneva, Victoria; Jurich, Daniel; Ha, Le An; Baldwin, Peter (Association for Computational Linguistics, 2021-04-30)
      This study examines the relationship between the linguistic characteristics of a test item and the complexity of the response process required to answer it correctly. Using data from a large-scale medical licensing exam, clustering methods identified items that were similar with respect to their relative difficulty and relative response-time intensiveness to create low response process complexity and high response process complexity item classes. Interpretable models were used to investigate the linguistic features that best differentiated between these classes from a descriptive and predictive framework. Results suggest that nuanced features such as the number of ambiguous medical terms help explain response process complexity beyond superficial item characteristics such as word count. Yet, although linguistic features carry signal relevant to response process complexity, the classification of individual items remains challenging.