• Handling cross and out-of-domain samples in Thai word segmentation

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-12-31)
      While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.
    • Sentiment analysis for Urdu online reviews using deep learning models

      Safder, Iqra; Mehmood, Zainab; Sarwar, Raheem; Hassan, Saeed-Ul; Zaman, Farooq; Adeel Nawab, Rao Muhammad; Bukhari, Faisal; Ayaz Abbasi, Rabeeh; Alelyani, Salem; Radi Aljohani, Naif; et al. (Wiley, 2021-12-31)
      Most existing studies are focused on popular languages like English, Spanish, Chinese, Japanese, and others, however, limited attention has been paid to Urdu despite having more than 60 million native speakers. In this paper, we develop a deep learning model for the sentiments expressed in this under-resourced language. We develop an open-source corpus of 10,008 reviews from 566 online threads on the topics of sports, food, software, politics, and entertainment. The objectives of this work are bi-fold (1) the creation of a human-annotated corpus for the research of sentiment analysis in Urdu; and (2) measurement of up-to-date model performance using a corpus. For their assessment, we performed binary and ternary classification studies utilizing another model, namely LSTM, RCNN Rule-Based, N-gram, SVM, CNN, and LSTM. The RCNN model surpasses standard models with 84.98 % accuracy for binary classification and 68.56 % accuracy for ternary classification. To facilitate other researchers working in the same domain, we have open-sourced the corpus and code developed for this research.
    • Las tecnologías de interpretación a distancia en los servicios públicos: uso e impacto

      Gaber, Mahmoud; Corpas Pastor, Gloria; Postigo Pinazo, Encarnación (Peter Lang, 2020-02-27)
      This chapter deals with the use of distance interpreting technologies and their impact on public services interpreters. Remote (or distance) interpreting offers a wide range of solutions in order to successfully satisfy the pressing need for languages services in both the public and private sectors. This study focuses on telephone-mediated and video-mediated interpreting, presenting their advantages and disadvantages. We have designed a survey to gather data about the psychological and physiological impact that remote interpreting technologies generate in community interpreters. Our main aim is to ascertain interpreters’ general view on technology, so as to detect deficiencies and suggest ways of improvement. This study is a first contribution in the direction of optimising distance interpreting technologies. Current demand reveals the enormous potential of distance interpreting, its rapid evolution and generalised presence that this modality will have in the future.
    • Introduction

      Corpas Pastor, Gloria; Colson, Jean-Pierre (John Benjamins Publishing Company, 2020-05-08)
    • El hablar y el discurso repetido: la fraseología

      Mellado, Carmen; Corpas, Gloria; Berty, Katrin; Loureda, Óscar; Schrott, Angela (De Gruyter, 2021-01-18)
      Este capitulo muestra la interrelacion entre fijacion y variabilidad en las unidades fraseologicas desde distintos puntos de vista. En primer lugar, realizamos un analisis detallado del concepto de «discurso repetido» de Coseriu, que ya considera en su origen la idea de cambio creativo, para despues ofrecer una panoramica de la evolucion de la fraseologia en relacion a la lingilistica textual. En segundo lugar, se presenta una clasificacion de la tipologia de la variacion fraseologica, ilustrada con ejemplos de corpus lingiiisticos y centrada en los niveles del sistema y habla, asi como en la intencionalidad del hablante. En tercer lugar, tratamos el tema de la variabilidad fraseologica y el giro que ha tornado la nocion de «fijacion» desde que se dispone de datos masivos de corpus. En este contexto, las magnitudes de frecuencia absoluta, normalizada y de significacion estadistica desempeiian un papel fundamental para el grado de fijacion.
    • Estrategias heurísticas con corpus para la enseñanza de la fraseología orientada a la traducción

      Corpas Pastor, Gloria; Hidalgo Ternero, Carlos Manuel; Seghiri, Miriam (Peter Lang, 2020)
      This work presents a didactic proposal carried out in the subject Lengua y cultura “B” aplicadas a la Traducción e Interpretación (II) – inglés, taught in the first year of the Bache-lor’s Degree in Translation and Interpreting, at the University of Malaga. The main objec-tive of this proposal is to teach the possibilities that both monolingual and bilingual corpora can provide for the correct identification and interpretation of phraseological units with regard to their translation, paying special attention to those cases where the ambiguity of phraseological sequences may lead to multiple interpretations. We will focus on somatisms and will mainly use two Spanish monolingual corpora (CORPES XXI and esEuTenTen), an English monolingual corpus (enTenTen) and two parallel corpora (Europarl and Linguee, more specifically its English-Spanish subcorpus). Against this background, this proposal is divided into several learning activities. After a first seminar where the concepts of corpus, phraseology and translation are introduced, in the learning activity 2 we will use parallel corpora to find translation pairings that contain translation mistakes caused by problems with phraseological ambiguity. Then, in the third learning activity, we will teach some disambiguating elements that will facilitate a correct identification and interpretation of the phraseological unit, in order to be able to convey its pragmatic and semantic weight in the target text. It is in this step where corpora can play a decisive role as documentation tools. Nevertheless, the localisation and interpretation of phraseological units is not problem-free. Given the necessity to develop some techniques that will enable a more effective detection of phraseological units, in the fourth learning activity students will learn an array of heuris-tic strategies to refine their searches in the consulted corpora as well as to select adequate equivalences after a correct interpretation of the results produced by these corpora.
    • Teaching idioms for translation purposes: a trilingual corpus-based glossary applied to phraseodidactics (ES/EN/DE)

      Corpas Pastor, Gloria; Hidalgo Ternero, Carlos Manuel; Bautista Zambrada, María Rosario; Martínez, Florentina Mena; Strohschen, Carola (Peter Lang, 2020)
      Phraseology plays a pivotal role in the development of translation competence as well as in translation quality assessment. Thus far, however, there remains a paucity of research on how to best teach idioms for translation purposes. Against such a background, this study aims to shed some light on the multiple applications of phraseodidactics to translation training. We will follow a corpus-based methodology and, for the sake of the argument, the focus will be on somatisms in Spanish, English and German. The overall structure of this paper takes the form of four sections. Section One begins by laying out the theoretical dimensions of phraseology and its convergence with translation. In section two we examine the main components of a corpus-based glossary of somatisms, named Glossomatic, and how it can be employed to establish ad hoc phraseological equivalences in those cases (analysed in section three) where the manipulation of idioms and the absence of one-to-one phraseological correspondence may pose some problems to translation. In this regard, given the importance of accurately conveying the pragmatic, semantic and discursive load of an idiom into a TT and, concomitantly, conveying the manipulation depicted in the ST, section four presents a teaching proposal in which students are prompted with a set of strategies and steps to be implemented with the aid of the glossary in order to solve these issues. Overall, the insights gained from this research will prove useful not only in developing trainees’ phraseological competence but also in giving centre stage to phraseodidactics in Translation Studies.
    • Knowledge distillation for quality estimation

      Gajbhiye, Amit; Fomicheva, Marina; Alva-Manchego, Fernando; Blain, Frederic; Obamuyide, Abiola; Aletras, Nikolaos; Specia, Lucia (Association for Computational Linguistics, 2021-12-31)
      Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models trained on distilled pre-trained representations remain prohibitively large for many usage scenarios. We instead propose to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
    • Herramientas y recursos electrónicos para la traducción de la manipulación fraseológica: un estudio de caso centrado en el estudiante

      Hidalgo Ternero, Carlos Manuel; Corpas Pastor, Gloria (Ediciones Universidad de Salamanca, 2021-05-13)
      En el presente artículo se analiza un estudio de caso llevado a cabo con estudiantes de la asignatura Traducción General «BA-AB» (II) - Inglés-Español / EspañolInglés, impartida en el segundo semestre del segundo curso del Grado en Traducción e Interpretación de la Universidad de Málaga. En él, en una primera fase, se les enseñó a los estudiantes cómo sacar el máximo partido de diferentes recursos y herramientas documentales electrónicos (corpus lingüísticos, recursos lexicográficos o la web, entre otros) para la creación de equivalencias textuales en aquellos casos en los que, fruto del anisomorfismo fraseológico interlingüe, la modificación creativa de unidades fraseológicas (UF) en el texto origen y la ausencia de correspondencias biunívocas presentan serias dificultades para el proceso traslaticio. De esta manera, a una primera actividad formativa sobre la traducción de usos creativos de unidades fraseológicas le sucede una sesión práctica en la que los alumnos tuvieron que enfrentarse a distintos casos de manipulación en el texto origen. Con el análisis de dichos resultados se podrá vislumbrar en qué medida los distintos recursos documentales ayudan a los traductores en formación a superar el desafío de la manipulación fraseológica
    • Backtranslation feedback improves user confidence in MT, not quality

      Zouhar, Vilém; Novák, Michal; Žilinec, Matúš; Bojar, Ondřej; Obregón, Mateo; Hill, Robin L; Blain, Frédéric; Fomicheva, Marina; Specia, Lucia; Yankovskaya, Lisa; et al. (Association for Computational Linguistics, 2021-06-01)
      Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.
    • Constructional idioms of ‘insanity’ in English and Spanish: A corpus-based study

      Corpas Pastor, Gloria (Elsevier, 2021-02-10)
      This paper presents a corpus-based study of constructions in English and Spanish, with a special emphasis on equivalent semantic-functional counterparts, and potential mismatches. Although usage/corpus-based Construction Grammar (CxG) has attracted much attention in recent years, most studies have dealt exclusively with monolingual constructions. In this paper we will focus on two constructions that represent conventional ways to express ‘insanity’ in both languages. The analysis will cover grammatical, semantic and informative aspects in order to establish a multi-linguistic prototype of the constructions. To that end, data from several giga-token corpora of contemporary spoken English and Spanish (parallel and comparable) have been selected. This study advances the explanatory potential of constructional idioms for the study of idiomaticity, variability and cross-language analysis. In addition, relevant findings on the dialectal distribution of certain idiom features across both languages and their national varieties are also reported.
    • Attention: there is an inconsistency between android permissions and application metadata!

      Alecakir, Huseyin; Can, Burcu; Sen, Sevil (Springer Science and Business Media LLC, 2021-01-07)
      Since mobile applications make our lives easier, there is a large number of mobile applications customized for our needs in the application markets. While the application markets provide us a platform for downloading applications, it is also used by malware developers in order to distribute their malicious applications. In Android, permissions are used to prevent users from installing applications that might violate the users’ privacy by raising their awareness. From the privacy and security point of view, if the functionality of applications is given in sufficient detail in their descriptions, then the requirement of requested permissions could be well-understood. This is defined as description-to-permission fidelity in the literature. In this study, we propose two novel models that address the inconsistencies between the application descriptions and the requested permissions. The proposed models are based on the current state-of-art neural architectures called attention mechanisms. Here, we aim to find the permission statement words or sentences in app descriptions by using the attention mechanism along with recurrent neural networks. The lack of such permission statements in application descriptions creates a suspicion. Hence, the proposed approach could assist in static analysis techniques in order to find suspicious apps and to prioritize apps for more resource intensive analysis techniques. The experimental results show that the proposed approach achieves high accuracy.
    • Bridging the “gApp”: improving neural machine translation systems for multiword expression detection

      Hidalgo-Ternero, Carlos Manuel; Pastor, Gloria Corpas (Walter de Gruyter GmbH, 2020-11-25)
      The present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.
    • La tecnología habla-texto como herramienta de documentación para intérpretes: Nuevo método para compilar un corpus ad hoc y extraer terminología a partir de discursos orales en vídeo

      Gaber, Mahmoud; Corpas Pastor, Gloria; Omer, Ahmed (Malaga University, 2020-12-22)
      Although interpreting has not yet benefited from technology as much as its sister field, translation, interest in developing tailor-made solutions for interpreters has risen sharply in recent years. In particular, Automatic Speech Recognition (ASR) is being used as a central component of Computer-Assisted Interpreting (CAI) tools, either bundled or standalone. This study pursues three main aims: (i) to establish the most suitable ASR application for building ad hoc corpora by comparing several ASR tools and assessing their performance; (ii) to use ASR in order to extract terminology from the transcriptions obtained from video-recorded speeches, in this case talks on climate change and adaptation; and (iii) to promote the adoption of ASR as a new documentation tool among interpreters. To the best of our knowledge, this is one of the first studies to explore the possibility of Speech-to-Text (S2T) technology for meeting the preparatory needs of interpreters as regards terminology and background/domain knowledge.
    • BERGAMOT-LATTE submissions for the WMT20 quality estimation shared task

      Fomicheva, Marina; Sun, Shuo; Yankovskaya, Lisa; Blain, Frédéric; Chaudhary, Vishrav; Fishel, Mark; Guzmán, Francisco; Specia, Lucia (Association for Computational Linguistics, 2020-11-30)
      This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.
    • Findings of the WMT 2020 shared task on quality estimation

      Specia, Lucia; Blain, Frédéric; Fomicheva, Marina; Fonseca, Erick; Chaudhary, Vishrav; Guzmán, Francisco; Martins, André FT (Association for Computational Linguistics, 2020-11-30)
      We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels. This edition included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian-English, Estonian-English, Sinhala-English and Nepali-English data for the sentence-level subtasks, English-German and English-Chinese for the word-level subtask, and English-French data for the document-level subtask. In addition, we made neural machine translation models available to participants. 19 participating teams from 27 institutions submitted altogether 1374 systems to different task variants and language pairs.
    • Webometrics: evolution of social media presence of universities

      Sarwar, Raheem; Zia, Afifa; Nawaz, Raheel; Fayoumi, Ayman; Aljohani, Naif Radi; Hassan, Saeed-Ul (Springer Science and Business Media LLC, 2021-01-03)
      This paper aims at an important task of computing the webometrics university ranking and investigating if there exists a correlation between webometrics university ranking and the rankings provided by the world prominent university rankers such as QS world university ranking, for the time period of 2005–2016. However, the webometrics portal provides the required data for the recent years only, starting from 2012, which is insufficient for such an investigation. The rest of the required data can be obtained from the internet archive. However, the existing data extraction tools are incapable of extracting the required data from internet archive, due to unusual link structure that consists of web archive link, year, date, and target links. We developed an internet archive scrapper and extract the required data, for the time period of 2012–2016. After extracting the data, the webometrics indicators were quantified, and the universities were ranked accordingly. We used correlation coefficient to identify the relationship between webometrics university ranking computed by us and the original webometrics university ranking, using the spearman and pearson correlation measures. Our findings indicate a strong correlation between ours and the webometrics university rankings, which proves that the applied methodology can be used to compute the webometrics university ranking of those years for which the ranking is not available, i.e., from 2005 to 2011. We compute the webometrics ranking of the top 30 universities of North America, Europe and Asia for the time period of 2005–2016. Our findings indicate a positive correlation for North American and European universities, but weak correlation for Asian universities. This can be explained by the fact that Asian universities did not pay much attention to their websites as compared to the North American and European universities. The overall results reveal the fact that North American and European universities are higher in rank as compared to Asian universities. To the best of our knowledge, such an investigation has been executed for the very first time by us and no recorded work resembling this has been done before.
    • Detecting semantic difference: a new model based on knowledge and collocational association

      Taslimipoor, Shiva; Corpas Pastor, Gloria; Rohanian, Omid; Corpas Pastor, Gloria; Colson, Jean-Pierre (John Benjamins Publishing Company, 2020-05-08)
      Semantic discrimination among concepts is a daily exercise for humans when using natural languages. For example, given the words, airplane and car, the word flying can easily be thought and used as an attribute to differentiate them. In this study, we propose a novel automatic approach to detect whether an attribute word represents the difference between two given words. We exploit a combination of knowledge-based and co-occurrence features (collocations) to capture the semantic difference between two words in relation to an attribute. The features are scores that are defined for each pair of words and an attribute, based on association measures, n-gram counts, word similarity, and Concept-Net relations. Based on these features we designed a system that run several experiments on a SemEval-2018 dataset. The experimental results indicate that the proposed model performs better, or at least comparable with, other systems evaluated on the same data for this task.
    • Domain adaptation of Thai word segmentation models using stacked ensemble

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2020-11-12)
      Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as “black boxes”. We propose a filter-and-refine solution based on the stacked-ensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our method against state-of-the-art models and transfer learning. Experimental results show that our proposed solution is an effective domain adaptation method and has a similar performance as the transfer learning method.
    • Sarcasm target identification with LSTM networks

      Bölücü, Necva; Can, Burcu (IEEE, 2021-01-07)
      Geçmi¸s yıllarda, kinayeli metinler üzerine yapılan çalı¸smalarda temel hedef metinlerin kinaye içerip içermediginin ˘ tespit edilmesiydi. Sosyal medya kullanımı ile birlikte siber zorbalıgın yaygınla¸sması, metinlerin sadece kinaye içerip içer- ˘ mediginin tespit edilmesinin yanısıra kinayeli metindeki hedefin ˘ belirlenmesini de gerekli kılmaya ba¸slamı¸stır. Bu çalı¸smada, kinayeli metinlerde hedef tespiti için bir derin ögrenme modeli ˘ kullanılarak hedef tespiti yapılmı¸s ve elde edilen sonuçlar literatürdeki ˙Ingilizce üzerine olan benzer çalı¸smalarla kıyaslanmı¸stır. Sonuçlar, önerdigimiz modelin kinaye hedef tespitinde benzer ˘ çalı¸smalara göre daha iyi çalı¸stıgını göstermektedir. The earlier work on sarcastic texts mainly concentrated on detecting the sarcasm on a given text. With the spread of cyber-bullying with the use of social media, it becomes also essential to identify the target of the sarcasm besides detecting the sarcasm. In this study, we propose a deep learning model for target identification on sarcastic texts and compare it with other work on English. The results show that our model outperforms the related work on sarcasm target identification.