Recent Submissions

  • “I don’t think education is the answer”: a corpus-assisted ecolinguistic analysis of plastics discourses in the UK

    Franklin, Emma; Gavins, Joanna; Mehl, Seth (De Gruyter Mouton, 2022-08-15)
    Ecosystems around the world are becoming engulfed in single-use plastics, the majority of which come from plastic packaging. Reusable plastic packaging systems have been proposed in response to this plastic waste crisis, but uptake of such systems in the UK is still very low. This article draws on a thematic corpus of 5.6 million words of UK English around plastics, packaging, reuse, and recycling to examine consumer attitudes towards plastic (re)use. Utilizing methods and insights from ecolinguistics, corpus linguistics, and cognitive linguistics, this article assesses to what degree consumer language differs from that of public-facing bodies such as supermarkets and government entities. A predefined ecosophy, prioritizing protection, rights, systems thinking, and fairness, is used to not only critically evaluate narratives in plastics discourse but also to recommend strategies for more effective and ecologically beneficial communications around plastics and reuse. This article recommends the adoption of ecosophy in multidisciplinary project teams, and argues that ecosophies are conducive to transparent and reproducible discourse analysis. The analysis also suggests that in order to make meaningful change in packaging reuse behaviors, it is highly likely that deeply ingrained cultural stories around power, rights, and responsibilities will need to be directly challenged.
  • The USMLE® Step 2 clinical skills patient note corpus

    Yaneva, Victoria; Mee, Janet; Ha, Le An; Harik, Polina; Jodoin, Michael; Mechaber, Alex (Association for Computational Linguistics, 2022-07-31)
    This paper presents a corpus of 43,985 clinical patient notes (PNs) written by 35,156 examinees during the high-stakes USMLE® Step 2 Clinical Skills examination. In this exam, examinees interact with standardized patients - people trained to portray simulated scenarios called clinical cases. For each encounter, an examinee writes a PN, which is then scored by physician raters using a rubric of clinical concepts, expressions of which should be present in the PN. The corpus features PNs from 10 clinical cases, as well as the clinical concepts from the case rubrics. A subset of 2,840 PNs were annotated by 10 physician experts such that all 143 concepts from the case rubrics (e.g., shortness of breath) were mapped to 34,660 PN phrases (e.g., dyspnea, difficulty breathing). The corpus is available via a data sharing agreement with NBME and can be requested at
  • Author gender identification for Urdu

    Sarwar, Raheem (Springer, 2022-09-30)
    In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machines outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at:
  • TurkishDelightNLP: A neural Turkish NLP toolkit

    Aleçakır, Hüseyin; Bölücü, Necva; Can, Burcu (ACL, 2022-07-01)
    We introduce a neural Turkish NLP toolkit called TurkishDelightNLP that performs computational linguistic analyses from morphological level to semantic level that involves tasks such as stemming, morphological segmentation, morphological tagging, part-of-speech tagging, dependency parsing, and semantic parsing, as well as high-level NLP tasks such as named entity recognition. We publicly share the open-source Turkish NLP toolkit through a web interface that allows an input text to be analysed in real-time, as well as the open source implementation of the components provided in the toolkit, an API, and several annotated datasets such as word similarity test set to evaluate word embeddings and UCCA-based semantic annotation in Turkish. This will be the first open-source Turkish NLP toolkit that involves a range of NLP tasks in all levels. We believe that it will be useful for other researchers in Turkish NLP and will be also beneficial for other high-level NLP tasks in Turkish.
  • Turkish universal conceptual cognitive annotation

    Bölücü, Necva; Can, Burcu; Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; et al. (European Language Resources Association, 2022-06-01)
    Universal Conceptual Cognitive Annotation (UCCA) is a cross-lingual semantic annotation framework that provides an easy annotation without any requirement for linguistic background. UCCA-annotated datasets have been already released in English, French, and German. In this paper, we introduce the first UCCA-annotated Turkish dataset that currently involves 50 sentences obtained from the METU-Sabanci Turkish Treebank. We followed a semi-automatic annotation approach, where an external semantic parser is utilised for an initial annotation of the dataset, which is partially accurate and requires refinement. We manually revised the annotations obtained from the semantic parser that are not in line with the UCCA rules that we defined for Turkish. We used the same external semantic parser for evaluation purposes and conducted experiments with both zero-shot and few-shot learning. This is the initial version of the annotated dataset and we are currently extending the dataset. We are releasing the current Turkish UCCA annotation guideline along with the annotated dataset.
  • Overview of the HASOC subtrack at FIRE 2021: Hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech

    Mandl, Thomas; Modha, Sandip; Shahi, Gautam Kishore; Madhu, Hiren; Satapara, Shrey; Majumder, Prasenjit; Schäfer, Johannes; Ranasinghe, Tharindu; Zampieri, Marcos; Nandini, Durgesh; et al. (Association for Computing Machinery, 2021-12-13)
    The HASOC track is dedicated to the evaluation of technology for finding Offensive Language and Hate Speech. HASOC is creating a multilingual data corpus mainly for English and under-resourced languages(Hindi and Marathi). This paper presents one HASOC subtrack with two tasks. In 2021, we organized the classification task for English, Hindi, and Marathi. The first task consists of two classification tasks; Subtask 1A consists of a binary and fine-grained classification into offensive and non-offensive tweets. Subtask 1B asks to classify the tweets into Hate, Profane and offensive. Task 2 consists of identifying tweets given additional context in the form of the preceding conversion. During the shared task, 65 teams have submitted 652 runs. This overview paper briefly presents the task descriptions, the data and the results obtained from the participant's submission.
  • Predicting lexical complexity in English texts: the Complex 2.0 dataset

    Shardlow, Matthew; Evans, Richard; Zampieri, Marcos (Springer, 2022-03-23)
    Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
  • TransQuest: Translation quality estimation with cross-lingual transformers

    Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (International Committee on Computational Linguistics, 2020-12-31)
    Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.
  • RGCL at SemEval-2020 task 6: Neural approaches to definition extraction

    Ranasinghe, Tharindu; Plum, Alistair; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2020-12-31)
    This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.
  • You are driving me up the wall! A corpus-based study of a special class of resultative constructions

    Corpas Pastor, Gloria (Université Jean Moulin - Lyon 3, 2022-03-26)
    This paper focuses on resultative constructions from a computational and corpus-based approach. We claim that the array of expressions (traditionally classed as idioms, collocations, free word combinations, etc.) that are used to convey a person’s change of mental state (typically negative) are basically instances of the same resultative construction. The first part of the study will introduce basic tenets of Construction Grammar and resultatives. Then, our corpus-based methodology will be spelled out, including a description of the two giga-token corpora used and a detailed account of our protocolised heuristic strategies and tasks. Distributional analysis of matrix slot fillers will be presented next, together with a discussion on restrictions, novel instances, and productivity. A final section will round up our study, with special attention to notions like “idiomaticity”, “productivity” and “variability” of the pairings of form and meaning analysed. To the best of our knowledge, this is one of the first studies based on giga-token corpora that explores idioms as integral parts of higher-order resultative constructions.
  • Multilingual offensive language identification for low-resource languages

    Ranasinghe, Tharindu; Zampieri, Marcos (Association for Computing Machinery, 2021-11-10)
    Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.
  • Intelligent translation memory matching and retrieval with sentence encoders

    Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2020-11-30)
    Matching and retrieving previously translated segments from a Translation Memory is the key functionality in Translation Memories systems. However this matching and retrieving process is still limited to algorithms based on edit distance which we have identified as a major drawback in Translation Memories systems. In this paper we introduce sentence encoders to improve the matching and retrieving process in Translation Memories systems - an effective and efficient solution to replace edit distance based algorithms.
  • TransQuest at WMT2020: Sentence-Level direct assessment

    Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2020-11-30)
    This paper presents the team TransQuest's participation in Sentence-Level Direct Assessment shared task in WMT 2020. We introduce a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. The proposed methods achieve state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. We further fine tune the QE framework by performing ensemble and data augmentation. Our approach is the winning solution in all of the language pairs according to the WMT 2020 official results.
  • Tuning language representation models for classification of Turkish news

    Tokgöz, Meltem; Turhan, Fatmanur; Bölücü, Necva; Can, Burcu (ACM, 2021-02-19)
    Pre-trained language representation models are very efficient in learning language representation independent from natural language processing tasks to be performed. The language representation models such as BERT and DistilBERT have achieved amazing results in many language understanding tasks. Studies on text classification problems in the literature are generally carried out for the English language. This study aims to classify the news in the Turkish language using pre-trained language representation models. In this study, we utilize BERT and DistilBERT by tuning both models for the text classification task to learn the categories of Turkish news with different tokenization methods. We provide a quantitative analysis of the performance of BERT and DistilBERT on the Turkish news dataset by comparing the models in terms of their representation capability in the text classification task. The highest performance is obtained with DistilBERT with an accuracy of 97.4%.
  • LSTM Ağları ile Türkçe Kök Bulma

    Can, Burcu (Gazi Üniversitesi, 2019-07-31)
    Türkçe, morfem adı verilen birimlerin art arda eklenmesiyle sözcüklerin oluşturulduğu sondan eklemeli bir dildir. Sözcüklerin farklı parçaların birleştirilmesiyle oluşturulması makine tercümesi, duygu analizi ve bilgi çıkarımı gibi birçok doğal dil işleme uygulamasında seyreklik problemine yol açmaktadır çünkü sözcüğün her farklı formu farklı bir sözcük gibi algılanmaktadır. Bu makalede, sözcüklerin yapım ve çekim eklerinden arındırılarak köklerinin otomatik olarak bulunabilmesi için bir yöntem öneriyoruz. Kullandığımız yöntem tekrarlayan sinir ağları kullanarak oluşturulan kodlayıcı-kod çözücü yaklaşımına dayanmaktadır. Verilen herhangi bir sözcük, oluşturduğumuz sinir ağı yapısı ile öncelikle kodlanmakta, ardından kodu çözülerek köküne ulaşılabilmektedir. Bu yöntem şimdiye kadar etiketleme veya makine tercümesi gibi problemlerde kullanılmıştır. Diğer Türkçe kök bulma modelleriyle karşılaştırıldığında sonuçların oldukça iyi olduğu gözlenmiştir. Diğer modellerde olduğu gibi, herhangi bir kural kümesi elle tanımlanmadan, sadece sözcük ve kök ikililerinden oluşan bir eğitim veri kümesi kullanılarak kök bulma işlemi önerdiğimiz bu model ile gerçekleştirilebilmektedir.
  • MLQE-PE: A multilingual quality estimation and post-editing dataset

    Fomicheva, Marina; Sun, Shuo; Fonseca, Erick; Zerva, Chrysoula; Blain, Frédéric; Chaudhary, Vishrav; Guzmán, Francisco; Lopatina, Nina; Specia, Lucia; Martins, André FT (arXiv, 2020-10-11)
    We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains eleven language pairs, with human labels for up to 10,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level good/bad labels. It also contains the post-edited sentences, as well as titles of the articles where the sentences were extracted from, and the neural MT models used to translate the text.
  • The Portrait of Dorian Gray: A corpus-based analysis of translated verb + noun (object) collocations in Peninsular and Colombian Spanish

    Valencia Giraldo, M. Victoria; Corpas Pastor, Gloria (Springer, 2019-09-18)
    Corpus-based Translation Studies have promoted research on the features of translated language, by focusing on the process and product of translation, from a descriptive perspective. Some of these features have been proposed by Toury [31] under the term of laws of translation, namely the law of growing standardisation and the law of interference. The law of standardisation appears to be particularly at play in diatopy, and more specifically in the case of transnational languages (e.g. English, Spanish, French, German). In fact, some studies have revealed the tendency to standardise the diatopic varieties of Spanish in translated language [8, 9, 11, 12]. This paper focuses on verb + noun (object) collocations of Spanish translations of The Portrait of Dorian Gray by Oscar Wilde. Two different varieties have been chosen (Peninsular and Colombian Spanish). Our main aim is to establish whether the Colombian Spanish translation actually matches the variety spoken in Colombia or it is closer to general or standard Spanish. For this purpose, the techniques used to translate this type of collocations in both Spanish translations will be analysed. Furthermore, the diatopic distribution of these collocations will be studied by means of large corpora.
  • Translating the discourse of medical tourism: A catalogue of resources and corpus for translators and researchers

    Davoust, E; Corpas Pastor, Gloria; Seghiri Domínguez, Miriam (The Slovak Association for the Study of English, 2018-12-18)
    The recent increase in medical tourism in Europe also means more written contents are translated on the web to get to potential clients. Translating cross-border care language is somehow challenging because it implies different agents and linguistic fields making it difficult for translators and researchers to be fully apprehended. We hereby present a catalogue of possible informative resources on medical tourism and an ad hoc corpus based on Spanish medical websites-focused on aesthetics and cosmetics-that were translated into English.
  • Compilación de un corpus ad hoc para la enseñanza de la traducción inversa especializada

    Corpas Pastor, Gloria (University of Malaga, 2001-12-31)
    En este trabajo se exploran las posibilidades presentes y futuras que ofrece la lingüística del corpus para los Estudios de Traducción, con especial referencia a la vertiente pedagógica. En la actualidad, la investigación basada en corpus constituye un componente esencial de los sistemas de traducción automática, los programas de extracción terminológica y conceptual, los estudios contrastivos y la caracterización de la lengua traducida. Los dos tipos de corpus más utilizados para tales fines son los comparables y los paralelos. En este artículo, sin embargo, se parte de un corpus ad hoc de textos originales comparables en calidad de macrofuente de documentación para la enseñanza y el ejercicio profesional de la traducción inversa especializada.
  • La variación fraseológica: análisis del rendimiento de los corpus monolingües como recursos de traducción

    Hidalgo-Ternero, Carlos Manuel; Corpas Pastor, Gloria (Faculty of Arts, Masaryk University, 2021-06-30)
    Las múltiples manifestaciones con las que se pueden presentar las unidades fraseológicas en el discurso (variación, flexión gramatical, discontinuidad…) hacen especialmente compleja la creación de patrones de búsqueda apropiados que permitan recuperarlas en todo su esplendor discursivo sin que ello implique un excesivo ruido documental. En este contexto, a lo largo del presente estudio se analiza el rendimiento de diferentes sistemas de gestión de corpus disponibles para el español en la consulta de las variantes fraseológicas tener entre manos, traer entre manos y llevar entre manos, e ir al pelo y venir al pelo. De forma concreta, se someterán a examen dos corpus creados por la RAE (el CREA, en sus versiones tradicional y anotada, y el CORPES XXI), el Corpus del Español de Mark Davies (BYU) y Sketch Engine. Los resultados arrojados por este análisis permitirán vislumbrar qué sistema de gestión de corpus ofrece un mejor rendimiento para los traductores ante el desafío de la variación fraseológica. Idioms tend to vary significantly in discourse (variation, grammatical inflection, discontinuity…). This makes it especially difficult to create appropriate query patterns that obtain these units in all shapes and forms while avoiding excessive noise. In this context, this paper analyses the performance of different corpus management systems available for Spanish when searching phraseological variants such as tener entre manos, traer entre manos and llevar entre manos, as well as ir al pelo and venir al pelo. More specifically, we will examine two corpora created by the Real Academia Española (CREA, in its original and annotated version, and CORPES XXI), the Corpus del Español by Mark Davies (BYU), and Sketch Engine. The results of our study will shed some light on which corpus management system can offer a better performance for translators under the challenge of idiom variation.

View more