Now showing items 1-20 of 308

    • Robust fragment-based framework for cross-lingual sentence retrieval

      Trijakwanich, Nattapol; Limkonchotiwat, Peerat; Sarwar, Raheem; Phatthiyaphaibun, Wannaphong; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-12-31)
      Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of- Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.
    • Linguistic features evaluation for hadith authenticity through automatic machine learning

      Mohamed, Emad; Sarwar, Raheem (Oxford University Press, 2021-12-31)
      There has not been any research that provides an evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith authenticity, and there is a need to build a “gold standard” corpus for good practices in Hadith authentication. We write a scraper in Python programming language and collect a corpus of 3651 authentic prophetic traditions and 3593 fake ones. We process the corpora with morphological segmentation and perform extensive experimental studies using a variety of machine learning algorithms, mainly through Automatic Machine Learning, to distinguish between these two categories. With a feature set including words, morphological segments, characters, top N words, top N segments, function words and several vocabulary richness features, we analyse the results in terms of both prediction and interpretability to explain which features are more characteristic of each class. Many experiments have produced good results and the highest accuracy (i.e., 78.28%) is achieved using word n-grams as features using the Multinomial Naive Bayes classifier. Our extensive experimental studies conclude that, at least for Digital Humanities, feature engineering may still be desirable due to the high interpretability of the features. The corpus and software (scripts) will be made publicly available to other researchers in an effort to promote progress and replicability.
    • A sequence labelling approach for automatic analysis of ello: tagging pronouns, antecedents, and connective phrases

      Parodi, Giovanni; Evans, Richard; Ha, Le An; Mitkov, Ruslan; Julio, Cristóbal; Olivares-López, Raúl Ignacio (Springer, 2021-09-04)
      Encapsulators are linguistic units which establish coherent referential connections to the preceding discourse in a text. In this paper, we address the challenge of automatically analysing the pronominal encapsulator ello in Spanish text. Our method identifies, for each occurrence, the antecedent of the pronoun (including its grammatical type), the connective phrase which combines with the pronoun to express a discourse relation linking the antecedent text segment to the following text segment, and the type of semantic relation expressed by the complex discourse marker formed by the connective phrase and pronoun. We describe our annotation of a corpus to inform the development of our method and to finetune an automatic analyser based on bidirectional encoder representation transformers (BERT). On testing our method, we find that it performs with greater accuracy than three baselines (0.76 for the resolution task), and sets a promising benchmark for the automatic annotation of occurrences of the pronoun ello, their antecedents, and the semantic relations between the two text segments linked by the connective in combination with the pronoun.
    • Exploiting tweet sentiments in altmetrics large-scale data

      Hassan, Saeed-Ul; Aljohani, Naif Radi; Iqbal Tarar, Usman; Safder, Iqra; Sarwar, Raheem; Alelyani, Salem; Nawaz, Raheel (SAGE, 2021-12-31)
      This article aims to exploit social exchanges on scientific literature, specifically tweets, to analyse social media users' sentiments towards publications within a research field. First, we employ the SentiStrength tool, extended with newly created lexicon terms, to classify the sentiments of 6,482,260 tweets associated with 1,083,535 publications provided by Altmetric.com. Then, we propose harmonic means-based statistical measures to generate a specialized lexicon, using positive and negative sentiment scores and frequency metrics. Next, we adopt a novel article-level summarization approach to domain-level sentiment analysis to gauge the opinion of social media users on Twitter about the scientific literature. Last, we propose and employ an aspect-based analytical approach to mine users' expressions relating to various aspects of the article, such as tweets on its title, abstract, methodology, conclusion, or results section. We show that research communities exhibit dissimilar sentiments towards their respective fields. The analysis of the field-wise distribution of article aspects shows that in Medicine, Economics, Business & Decision Sciences, tweet aspects are focused on the results section. In contrast, Physics & Astronomy, Materials Sciences, and Computer Science these aspects are focused on the methodology section. Overall, the study helps us to understand the sentiments of online social exchanges of the scientific community on scientific literature. Specifically, such a fine-grained analysis may help research communities in improving their social media exchanges about the scientific articles to disseminate their scientific findings effectively and to further increase their societal impact.
    • SemEval-2021 task 1: Lexical complexity prediction

      Shardlow, Matthew; Evans, Richard; Paetzold, Gustavo Henrique; Zampieri, Marcos (Association for Computational Linguistics, 2021-08-01)
      This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.
    • Natural language processing for mental disorders: an overview

      Calixto, Iacer; Yaneva, Viktoriya; Cardoso, Raphael (CRC Press, 2021-12-31)
    • Remote interpreting in public service settings: Technology, perceptions and practice

      Corpas Pastor, Gloria; Gaber, Mahmoud (The Slovak Association for the Study of English, 2020-12-31)
      Remote interpretation technology is developing extremely fast, enabling affordable and instant access to interpreting services worldwide. This paper focuses on the subjective perceptions of public service interpreters about the psychological and physical impact of using remote interpreting, and the effects on their own performance. To this end, a survey study has been conducted by means of an on-line questionnaire. Both structured and unstructured questions have been used to tap into interpreters’ view on technology, elicit information about perceived effects, and identify pitfalls and prospects.
    • Management of 201 individuals with emotionally unstable personality disorders: A naturalistic observational study in real-world inpatient setting

      Shahpesandy, Homayun; Mohammed-Ali, Rosemary; Oakes, Michael; Al-Kubaisy, Tarik; Cheetham, Anna; Anene, Moses; The Hartsholme Centre, Long Leys Road, Lincoln, LN1 1FS, Lincolnshire NHS Foundation Trust, UK. (Maghira & Maas Publications, 2021-06-03)
      BACKGROUND: Emotionally unstable personality disorder (EUPD) is a challenging condition with a prevalence of 20% in inpatient services. Psychotherapy is the preferred treatment; nevertheless, off-license medications are widely used. OBJECTIVES: To identify socio-demographics, clinical and service-delivery characteristics of people with EUPD admitted to inpatient services between 1st January 2017 and 31st December 2018. METHODS: A retrospective review using data from patients' records. Individuals, age 18-65 were included. Statistical analysis was conducted by the Mann-Whitney-Wilcoxon test and Chi-squared test with Yates's continuity correction. RESULTS: Of 1646 inpatients, 201 (12.2%); had the diagnosis of EUPD; 133 (66.0%) women, 68 (44.0%). EUPD was significantly (P < .001) more prevalent in women (18.2%) than men (7.4%). EUPD patients were significantly (P < .001) younger (32.2 years) than patients without EUPD (46 years), and had significantly (P < .001) more admissions (1.74) than patients without EUPD (1.2 admission). 70.5% of patients had one and 17.0% two Axis-I psychiatric co-morbidities. Substance use was significantly (P < .001) more often in men (57.3%) than in women (28.5%). Significantly (P = 0.047) more women (68.4%) than men (53.0%) reported sexual abuse. 87.5% used polypharmacy. Antidepressants were significantly (P = 0.02) often prescribed to women (76.6%) than men (69.1%). Significantly (P = 0.02) more women (83.5%) than men (67.6%) were on antipsychotics. 57.2% of the patients were on anxiolytics, 40.0% on hypnotics and 25.8% on mood stabilisers. CONCLUSION: EUPD is a complex condition with widespread comorbidity. The term EUPD, Borderline Personality Disorder is unsuitable, stigmatising and too simplistic to reflect the nature, gravity and psychopathology of this syndrome.
    • Urdu AI: writeprints for Urdu authorship identification

      Sarwar, Raheem; Hassan, Saeed-Ul (Association for Computing Machinery, 2021-12-31)
      The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.
    • Combining text and images for film age appropriateness classification

      Ha, Le; Mohamed, Emad (Elsevier, 2021-07-14)
      We combine textual information from a corpus of film scripts and the images of important scenes from IMDB that correspond to these films to create a bimodal dataset (the dataset and scripts can be obtained from https://tinyurl.com/se9tlmr) for film age appropriateness classification with the objective of improving the prediction of age appropriateness for parents and children. We use state-of-the art Deep Learning image feature extraction, including DENSENet, ResNet, Inception, and NASNet. We have tested several Machine learning algorithms and have found xgboost to yield the best results. Previously reported classification accuracy, using only textual features, were 79.1% and 65.3% for American MPAA and British BBFC classification respectively. Using images alone, we achieve 64.8% and 56.7% classification accuracy. The most consistent combination of textual features and images’ features achieves 81.1% and 66.8%, both statistically significant improvements over the use of text only.
    • Decálogo de características de la literatura poscolonial: propuesta de una taxonomía para la crítica literaria y los estudios de literatura comparada

      Fernández Ruiz, María Remedios; Corpas Pastor, Gloria; Seghiri, Míriam (Editorial CSIC, 2021-06-22)
      El objetivo de este artículo es ofrecer una propuesta de clasificación de los rasgos presentes, en mayor o menor medida, en la literatura poscolonial en cualquier idioma. A pesar de que esta taxonomía toma como punto de partida definiciones teóricas previas de los conceptos clave relacionados con la literatura poscolonial (Edwards 2008, Nayar 2008 y Ramone 2011), parece ser la primera clasificación formal que se ha elaborado al respecto. De este modo, se analizan conceptos consolidados a la par que presenta la nueva noción de plasticidad de géneros literarios y explora las corrientes actuales en la investigación de la interseccionalidad. Como resultado, proporcionaremos un decálogo de características de la literatura poscolonial que favorecerá la crítica literaria y los estudios de literatura comparada.
    • Handling cross and out-of-domain samples in Thai word segmentation

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-08-01)
      While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.
    • Sentiment analysis for Urdu online reviews using deep learning models

      Safder, Iqra; Mehmood, Zainab; Sarwar, Raheem; Hassan, Saeed-Ul; Zaman, Farooq; Adeel Nawab, Rao Muhammad; Bukhari, Faisal; Ayaz Abbasi, Rabeeh; Alelyani, Salem; Radi Aljohani, Naif; et al. (Wiley, 2021-06-28)
      Most existing studies are focused on popular languages like English, Spanish, Chinese, Japanese, and others, however, limited attention has been paid to Urdu despite having more than 60 million native speakers. In this paper, we develop a deep learning model for the sentiments expressed in this under-resourced language. We develop an open-source corpus of 10,008 reviews from 566 online threads on the topics of sports, food, software, politics, and entertainment. The objectives of this work are bi-fold (1) the creation of a human-annotated corpus for the research of sentiment analysis in Urdu; and (2) measurement of up-to-date model performance using a corpus. For their assessment, we performed binary and ternary classification studies utilizing another model, namely LSTM, RCNN Rule-Based, N-gram, SVM, CNN, and LSTM. The RCNN model surpasses standard models with 84.98 % accuracy for binary classification and 68.56 % accuracy for ternary classification. To facilitate other researchers working in the same domain, we have open-sourced the corpus and code developed for this research.
    • Las tecnologías de interpretación a distancia en los servicios públicos: uso e impacto

      Gaber, Mahmoud; Corpas Pastor, Gloria; Postigo Pinazo, Encarnación (Peter Lang, 2020-02-27)
      This chapter deals with the use of distance interpreting technologies and their impact on public services interpreters. Remote (or distance) interpreting offers a wide range of solutions in order to successfully satisfy the pressing need for languages services in both the public and private sectors. This study focuses on telephone-mediated and video-mediated interpreting, presenting their advantages and disadvantages. We have designed a survey to gather data about the psychological and physiological impact that remote interpreting technologies generate in community interpreters. Our main aim is to ascertain interpreters’ general view on technology, so as to detect deficiencies and suggest ways of improvement. This study is a first contribution in the direction of optimising distance interpreting technologies. Current demand reveals the enormous potential of distance interpreting, its rapid evolution and generalised presence that this modality will have in the future.
    • Introduction

      Corpas Pastor, Gloria; Colson, Jean-Pierre (John Benjamins Publishing Company, 2020-05-08)
    • El hablar y el discurso repetido: la fraseología

      Mellado, Carmen; Corpas, Gloria; Berty, Katrin; Loureda, Óscar; Schrott, Angela (De Gruyter, 2021-01-18)
      Este capitulo muestra la interrelacion entre fijacion y variabilidad en las unidades fraseologicas desde distintos puntos de vista. En primer lugar, realizamos un analisis detallado del concepto de «discurso repetido» de Coseriu, que ya considera en su origen la idea de cambio creativo, para despues ofrecer una panoramica de la evolucion de la fraseologia en relacion a la lingilistica textual. En segundo lugar, se presenta una clasificacion de la tipologia de la variacion fraseologica, ilustrada con ejemplos de corpus lingiiisticos y centrada en los niveles del sistema y habla, asi como en la intencionalidad del hablante. En tercer lugar, tratamos el tema de la variabilidad fraseologica y el giro que ha tornado la nocion de «fijacion» desde que se dispone de datos masivos de corpus. En este contexto, las magnitudes de frecuencia absoluta, normalizada y de significacion estadistica desempeiian un papel fundamental para el grado de fijacion.
    • Estrategias heurísticas con corpus para la enseñanza de la fraseología orientada a la traducción

      Corpas Pastor, Gloria; Hidalgo Ternero, Carlos Manuel; Seghiri, Miriam (Peter Lang, 2020)
      This work presents a didactic proposal carried out in the subject Lengua y cultura “B” aplicadas a la Traducción e Interpretación (II) – inglés, taught in the first year of the Bache-lor’s Degree in Translation and Interpreting, at the University of Malaga. The main objec-tive of this proposal is to teach the possibilities that both monolingual and bilingual corpora can provide for the correct identification and interpretation of phraseological units with regard to their translation, paying special attention to those cases where the ambiguity of phraseological sequences may lead to multiple interpretations. We will focus on somatisms and will mainly use two Spanish monolingual corpora (CORPES XXI and esEuTenTen), an English monolingual corpus (enTenTen) and two parallel corpora (Europarl and Linguee, more specifically its English-Spanish subcorpus). Against this background, this proposal is divided into several learning activities. After a first seminar where the concepts of corpus, phraseology and translation are introduced, in the learning activity 2 we will use parallel corpora to find translation pairings that contain translation mistakes caused by problems with phraseological ambiguity. Then, in the third learning activity, we will teach some disambiguating elements that will facilitate a correct identification and interpretation of the phraseological unit, in order to be able to convey its pragmatic and semantic weight in the target text. It is in this step where corpora can play a decisive role as documentation tools. Nevertheless, the localisation and interpretation of phraseological units is not problem-free. Given the necessity to develop some techniques that will enable a more effective detection of phraseological units, in the fourth learning activity students will learn an array of heuris-tic strategies to refine their searches in the consulted corpora as well as to select adequate equivalences after a correct interpretation of the results produced by these corpora.
    • Teaching idioms for translation purposes: a trilingual corpus-based glossary applied to phraseodidactics (ES/EN/DE)

      Corpas Pastor, Gloria; Hidalgo Ternero, Carlos Manuel; Bautista Zambrada, María Rosario; Martínez, Florentina Mena; Strohschen, Carola (Peter Lang, 2020)
      Phraseology plays a pivotal role in the development of translation competence as well as in translation quality assessment. Thus far, however, there remains a paucity of research on how to best teach idioms for translation purposes. Against such a background, this study aims to shed some light on the multiple applications of phraseodidactics to translation training. We will follow a corpus-based methodology and, for the sake of the argument, the focus will be on somatisms in Spanish, English and German. The overall structure of this paper takes the form of four sections. Section One begins by laying out the theoretical dimensions of phraseology and its convergence with translation. In section two we examine the main components of a corpus-based glossary of somatisms, named Glossomatic, and how it can be employed to establish ad hoc phraseological equivalences in those cases (analysed in section three) where the manipulation of idioms and the absence of one-to-one phraseological correspondence may pose some problems to translation. In this regard, given the importance of accurately conveying the pragmatic, semantic and discursive load of an idiom into a TT and, concomitantly, conveying the manipulation depicted in the ST, section four presents a teaching proposal in which students are prompted with a set of strategies and steps to be implemented with the aid of the glossary in order to solve these issues. Overall, the insights gained from this research will prove useful not only in developing trainees’ phraseological competence but also in giving centre stage to phraseodidactics in Translation Studies.
    • Knowledge distillation for quality estimation

      Gajbhiye, Amit; Fomicheva, Marina; Alva-Manchego, Fernando; Blain, Frederic; Obamuyide, Abiola; Aletras, Nikolaos; Specia, Lucia (Association for Computational Linguistics, 2021-08-01)
      Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models trained on distilled pre-trained representations remain prohibitively large for many usage scenarios. We instead propose to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
    • Herramientas y recursos electrónicos para la traducción de la manipulación fraseológica: un estudio de caso centrado en el estudiante

      Hidalgo Ternero, Carlos Manuel; Corpas Pastor, Gloria (Ediciones Universidad de Salamanca, 2021-05-13)
      En el presente artículo se analiza un estudio de caso llevado a cabo con estudiantes de la asignatura Traducción General «BA-AB» (II) - Inglés-Español / EspañolInglés, impartida en el segundo semestre del segundo curso del Grado en Traducción e Interpretación de la Universidad de Málaga. En él, en una primera fase, se les enseñó a los estudiantes cómo sacar el máximo partido de diferentes recursos y herramientas documentales electrónicos (corpus lingüísticos, recursos lexicográficos o la web, entre otros) para la creación de equivalencias textuales en aquellos casos en los que, fruto del anisomorfismo fraseológico interlingüe, la modificación creativa de unidades fraseológicas (UF) en el texto origen y la ausencia de correspondencias biunívocas presentan serias dificultades para el proceso traslaticio. De esta manera, a una primera actividad formativa sobre la traducción de usos creativos de unidades fraseológicas le sucede una sesión práctica en la que los alumnos tuvieron que enfrentarse a distintos casos de manipulación en el texto origen. Con el análisis de dichos resultados se podrá vislumbrar en qué medida los distintos recursos documentales ayudan a los traductores en formación a superar el desafío de la manipulación fraseológica