• Teaching idioms for translation purposes: a trilingual corpus-based glossary applied to phraseodidactics (ES/EN/DE)

      Corpas Pastor, Gloria; Hidalgo Ternero, Carlos Manuel; Bautista Zambrada, María Rosario; Martínez, Florentina Mena; Strohschen, Carola (Peter Lang, 2020)
      Phraseology plays a pivotal role in the development of translation competence as well as in translation quality assessment. Thus far, however, there remains a paucity of research on how to best teach idioms for translation purposes. Against such a background, this study aims to shed some light on the multiple applications of phraseodidactics to translation training. We will follow a corpus-based methodology and, for the sake of the argument, the focus will be on somatisms in Spanish, English and German. The overall structure of this paper takes the form of four sections. Section One begins by laying out the theoretical dimensions of phraseology and its convergence with translation. In section two we examine the main components of a corpus-based glossary of somatisms, named Glossomatic, and how it can be employed to establish ad hoc phraseological equivalences in those cases (analysed in section three) where the manipulation of idioms and the absence of one-to-one phraseological correspondence may pose some problems to translation. In this regard, given the importance of accurately conveying the pragmatic, semantic and discursive load of an idiom into a TT and, concomitantly, conveying the manipulation depicted in the ST, section four presents a teaching proposal in which students are prompted with a set of strategies and steps to be implemented with the aid of the glossary in order to solve these issues. Overall, the insights gained from this research will prove useful not only in developing trainees’ phraseological competence but also in giving centre stage to phraseodidactics in Translation Studies.
    • Technology solutions for interpreters: the VIP system

      Corpas Pastor, Gloria (Universidad de Valladolid, 2022-01-07)
      Interpreting technologies have abruptly entered the profession in recent years. However, technology still remains a relatively marginal topic of academic debate, although interest in developing tailor-made solutions for interpreters has risen sharply. This paper presents the VIP system, one of the research outputs of the homonymous project VIP - Voice-text Integrated system for interPreters, and its continuation (VIP II). More specifically, a technology-based terminology workflow for simultaneous interpretation is presented.
    • The first Automatic Translation Memory Cleaning Shared Task

      Barbu, Eduard; Parra Escartín, Carla; Bentivogli, Luisa; Negri, Matteo; Turchi, Marco; Orasan, Constantin; Federico, Marcello (Springer, 2017-01-21)
      This paper reports on the organization and results of the rst Automatic Translation Memory Cleaning Shared Task. This shared task is aimed at nding automatic ways of cleaning translation memories (TMs) that have not been properly curated and thus include incorrect translations. As a follow up of the shared task, we also conducted two surveys, one targeting the teams participating in the shared task, and the other one targeting professional translators. While the researchers-oriented survey aimed at gathering information about the opinion of participants on the shared task, the translators-oriented survey aimed to better understand what constitutes a good TM unit and inform decisions that will be taken in future editions of the task. In this paper, we report on the process of data preparation and the evaluation of the automatic systems submitted, as well as on the results of the collected surveys.
    • The influence of highly cited papers on field normalised indicators

      Thelwall, Mike (Springer, 2019-01-05)
      Field normalised average citation indicators are widely used to compare countries, universities and research groups. The most common variant, the Mean Normalised Citation Score (MNCS), is known to be sensitive to individual highly cited articles but the extent to which this is true for a log-based alternative, the Mean Normalised Log Citation Score (MNLCS), is unknown. This article investigates country-level highly cited outliers for MNLCS and MNCS for all Scopus articles from 2013 and 2012. The results show that MNLCS is influenced by outliers, as measured by kurtosis, but at a much lower level than MNCS. The largest outliers were affected by the journal classifications, with the Science-Metrix scheme producing much weaker outliers than the internal Scopus scheme. The high Scopus outliers were mainly due to uncitable articles reducing the average in some humanities categories. Although outliers have a numerically small influence on the outcome for individual countries, changing indicator or classification scheme influences the results enough to affect policy conclusions drawn from them. Future field normalised calculations should therefore explicitly address the influence of outliers in their methods and reporting.
    • The research production of nations and departments: A statistical model for the share of publications

      Thelwall, Mike; Fairclough, Ruth (Elsevier, 2017-11-04)
      Policy makers and managers sometimes assess the share of research produced by a group (country, department, institution). This takes the form of the percentage of publications in a journal, field or broad area that has been published by the group. This quantity is affected by essentially random influences that obscure underlying changes over time and differences between groups. A model of research production is needed to help identify whether differences between two shares indicate underlying differences. This article introduces a simple production model for indicators that report the share of the world’s output in a journal or subject category, assuming that every new article has the same probability to be authored by a given group. With this assumption, confidence limits can be calculated for the underlying production capability (i.e., probability to publish). The results of a time series analysis of national contributions to 36 large monodisciplinary journals 1996-2016 are broadly consistent with this hypothesis. Follow up tests of countries and institutions in 26 Scopus subject categories support the conclusions but highlight the importance of ensuring consistent subject category coverage.
    • Three kinds of semantic resonance

      Hanks, Patrick (Ivane Javakhishvili Tbilisi University Press, 2016-09-06)
      This presentation suggests some reasons why lexicographers of the future will need to pay more attention to phraseology and non-literal meaning. It argues that not only do words have literal meaning, but also that much meaning is non-literal, being lexical, i.e. metaphorical or figurative, experiential, or intertextual.
    • Three practical field normalised alternative indicator formulae for research evaluation

      Thelwall, Mike (Elsevier, 2017-01-04)
      Although altmetrics and other web-based alternative indicators are now commonplace in publishers’ websites, they can be difficult for research evaluators to use because of the time or expense of the data, the need to benchmark in order to assess their values, the high proportion of zeros in some alternative indicators, and the time taken to calculate multiple complex indicators. These problems are addressed here by (a) a field normalisation formula, the Mean Normalised Log-transformed Citation Score (MNLCS) that allows simple confidence limits to be calculated and is similar to a proposal of Lundberg, (b) field normalisation formulae for the proportion of cited articles in a set, the Equalised Mean-based Normalised Proportion Cited (EMNPC) and the Mean-based Normalised Proportion Cited (MNPC), to deal with mostly uncited data sets, (c) a sampling strategy to minimise data collection costs, and (d) free unified software to gather the raw data, implement the sampling strategy, and calculate the indicator formulae and confidence limits. The approach is demonstrated (but not fully tested) by comparing the Scopus citations, Mendeley readers and Wikipedia mentions of research funded by Wellcome, NIH, and MRC in three large fields for 2013–2016. Within the results, statistically significant differences in both citation counts and Mendeley reader counts were found even for sets of articles that were less than six months old. Mendeley reader counts were more precise than Scopus citations for the most recent articles and all three funders could be demonstrated to have an impact in Wikipedia that was significantly above the world average.
    • Three target document range metrics for university web sites

      Thelwall, Mike; Wilkinson, David (Wiley, 2003)
      Three new metrics are introduced that measure the range of use of a university Web site by its peers through different heuristics for counting links targeted at its pages. All three give results that correlate significantly with the research productivity of the target institution. The directory range model, which is based upon summing the number of distinct directories targeted by each other university, produces the most promising results of any link metric yet. Based upon an analysis of changes between models, it is suggested that range models measure essentially the same quantity as their predecessors but are less susceptible to spurious causes of multiple links and are therefore more robust.
    • Toponym detection in the bio-medical domain: A hybrid approach with deep learning

      Plum, Alistair; Ranasinghe, Tharindu; Orăsan, Constantin (RANLP, 2019-09-02)
      This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.
    • Transfer learning for Turkish named entity recognition on noisy text

      Kagan Akkaya, E; Can, Burcu (Cambridge University Press (CUP), 2020-01-28)
      © Cambridge University Press 2020. In this article, we investigate using deep neural networks with different word representation techniques for named entity recognition (NER) on Turkish noisy text. We argue that valuable latent features for NER can, in fact, be learned without using any hand-crafted features and/or domain-specific resources such as gazetteers and lexicons. In this regard, we utilize character-level, character n-gram-level, morpheme-level, and orthographic character-level word representations. Since noisy data with NER annotation are scarce for Turkish, we introduce a transfer learning model in order to learn infrequent entity types as an extension to the Bi-LSTM-CRF architecture by incorporating an additional conditional random field (CRF) layer that is trained on a larger (but formal) text and a noisy text simultaneously. This allows us to learn from both formal and informal/noisy text, thus improving the performance of our model further for rarely seen entity types. We experimented on Turkish as a morphologically rich language and English as a relatively morphologically poor language. We obtained an entity-level F1 score of 67.39% on Turkish noisy data and 45.30% on English noisy data, which outperforms the current state-of-art models on noisy text. The English scores are lower compared to Turkish scores because of the intense sparsity in the data introduced by the user writing styles. The results prove that using subword information significantly contributes to learning latent features for morphologically rich languages.
    • Translating English verbal collocations into Spanish: On distribution and other relevant differences related to diatopic variation

      Corpas Pastor, Gloria (John Benjamins Publishing Company, 2015-12-21)
      Language varieties should be taken into account in order to enhance fluency and naturalness of translated texts. In this paper we will examine the collocational verbal range for prima-facie translation equivalents of words like decision and dilemma, which in both languages denote the act or process of reaching a resolution after consideration, resolving a question or deciding something. We will be mainly concerned with diatopic variation in Spanish. To this end, we set out to develop a giga-token corpus-based protocol which includes a detailed and reproducible methodology sufficient to detect collocational peculiarities of transnational languages. To our knowledge, this is one of the first observational studies of this kind. The paper is organised as follows. Section 1 introduces some basic issues about the translation of collocations against the background of languages’ anisomorphism. Section 2 provides a feature characterisation of collocations. Section 3 deals with the choice of corpora, corpus tools, nodes and patterns. Section 4 covers the automatic retrieval of the selected verb + noun (object) collocations in general Spanish and the co-existing national varieties. Special attention is paid to comparative results in terms of similarities and mismatches. Section 5 presents conclusions and outlines avenues of further research.
    • Translating the discourse of medical tourism: A catalogue of resources and corpus for translators and researchers

      Davoust, E; Corpas Pastor, Gloria; Seghiri Domínguez, Miriam (The Slovak Association for the Study of English, 2018-12-18)
      The recent increase in medical tourism in Europe also means more written contents are translated on the web to get to potential clients. Translating cross-border care language is somehow challenging because it implies different agents and linguistic fields making it difficult for translators and researchers to be fully apprehended. We hereby present a catalogue of possible informative resources on medical tourism and an ad hoc corpus based on Spanish medical websites-focused on aesthetics and cosmetics-that were translated into English.
    • Translation quality and productivity: a study on rich morphology languages

      Specia, Lucia; Harris, Kim; Burchardt, Aljoscha; Turchi, Marco; Negri, Matteo; Skadina, Inguna (Asia-Pacific Association for Machine Translation, 2017)
      Specia, L., Blain, F., Harris, K., Burchardt, A. et al. (2017) Translation quality and productivity: a study on rich morphology languages. In, Machine Translation Summit XVI, Vol 1. MT Research Track, Kurohashi, S., and Fung, P., Nagoya, Aichi, Japan: Asia-Pacific Association for Machine Translation, pp. 55-71.
    • Translationese and register variation in English-to-Russian professional translation

      Kunilovskaya, Maria; Corpas Pastor, Gloria; Wang, Vincent; Lim, Lily; Li, Defeng (Springer Singapore, 2021-10-12)
      This study explores the impact of register on the properties of translations. We compare sources, translations and non-translated reference texts to describe the linguistic specificity of translations common and unique between four registers. Our approach includes bottom-up identification of translationese effects that can be used to define translations in relation to contrastive properties of each register. The analysis is based on an extended set of features that reflect morphological, syntactic and text-level characteristics of translations. We also experiment with lexis-based features from n-gram language models estimated on large bodies of originally- authored texts from the included registers. Our parallel corpora are built from published English-to-Russian professional translations of general domain mass-media texts, popular-scientific books, fiction and analytical texts on political and economic news. The number of observations and the data sizes for parallel and reference components are comparable within each register and range from 166 (fiction) to 525 (media) text pairs; from 300,000 to 1 million tokens. Methodologically, the research relies on a series of supervised and unsupervised machine learning techniques, including those that facilitate visual data exploration. We learn a number of text classification models and study their performance to assess our hypotheses. Further on, we analyse the usefulness of the features for these classifications to detect the best translationese indicators in each register. The multivariate analysis via text classification is complemented by univariate statistical analysis which helps to explain the observed deviation of translated registers through a number of translationese effects and detect the features that contribute to them. Our results demonstrate that each register generates a unique form of translationese that can be only partially explained by cross-linguistic factors. Translated registers differ in the amount and type of prevalent translationese. The same translationese tendencies in different registers are manifested through different features. In particular, the notorious shining-through effect is more noticeable in general media texts and news commentary and is less prominent in fiction.
    • TransQuest at WMT2020: Sentence-Level direct assessment

      Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2020-11-30)
      This paper presents the team TransQuest's participation in Sentence-Level Direct Assessment shared task in WMT 2020. We introduce a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. The proposed methods achieve state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. We further fine tune the QE framework by performing ensemble and data augmentation. Our approach is the winning solution in all of the language pairs according to the WMT 2020 official results.
    • TransQuest: Translation quality estimation with cross-lingual transformers

      Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (International Committee on Computational Linguistics, 2020-12-31)
      Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.
    • Tree structured Dirichlet processes for hierarchical morphological segmentation

      Can, Burcu; Manandhar, Suresh (MIT Press, 2018-06-25)
      This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process. Tree hierarchies are learned along with the corresponding morphological paradigms simultaneously. Our model is evaluated on Morpho Challenge and shows competitive performance when compared to state-of-the-art unsupervised morphological segmentation systems. Although we apply this model for morphological segmentation, the model itself can also be used for hierarchical clustering of other types of data.
    • Trouble on the road: Finding reasons for commuter stress from tweets

      Gopalakrishna Pillai, Reshmi; Thelwall, Mike; Orasan, Constantin (Association for Computational Linguistics, 2018-11-30)
      Intelligent Transportation Systems could benefit from harnessing social media content to get continuous feedback. In this work, we implement a system to identify reasons for stress in tweets related to traffic using a word vector strategy to select a reason from a predefined list generated by topic modeling and clustering. The proposed system, which performs better than standard machine learning algorithms, could provide inputs to warning systems for commuters in the area and feedback for the authorities.
    • Tuning language representation models for classification of Turkish news

      Tokgöz, Meltem; Turhan, Fatmanur; Bölücü, Necva; Can, Burcu (ACM, 2021-02-19)
      Pre-trained language representation models are very efficient in learning language representation independent from natural language processing tasks to be performed. The language representation models such as BERT and DistilBERT have achieved amazing results in many language understanding tasks. Studies on text classification problems in the literature are generally carried out for the English language. This study aims to classify the news in the Turkish language using pre-trained language representation models. In this study, we utilize BERT and DistilBERT by tuning both models for the text classification task to learn the categories of Turkish news with different tokenization methods. We provide a quantitative analysis of the performance of BERT and DistilBERT on the Turkish news dataset by comparing the models in terms of their representation capability in the text classification task. The highest performance is obtained with DistilBERT with an accuracy of 97.4%.
    • Turkish lexicon expansion by using finite state automata

      Öztürk, Burak; Can, Burcu (Scientific and Technological Research Council of Turkey, 2019-03-22)
      Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.