• All that Glitters is not Gold when Translating Phraseological Units

      Corpas Pastor, Gloria; Monti, Johanna; Mitkov, Ruslan; Corpas Pastor, Gloria; Seretan, Violeta (European Association for Machine Translation (EAMT), 2013-09-02)
      Phraseological unit is an umbrella term which covers a wide range of multi-word units (collocations, idioms, proverbs, routine formulae, etc.). Phraseological units (PUs) are pervasive in all languages and exhibit a peculiar combinatorial nature. PUs are usually frequent, cognitively salient, syntactically frozen and/or semantically opaque. Besides, their creative manipulations in discourse can be anything but predictable, straightforward or easy to process. And when it comes to translating, problems multiply exponentially. It goes without saying that cultural differences and linguistic anisomorphisms go hand in hand with issues arising from varying degrees of equivalence at the levels of system and text. No wonder PUs have been considered a pain in the neck within the NLP community. This presentation will focus on contrastive and translational features of phraseological units. It will consist of three parts. As a convenient background, the first part will contrast two similar concepts: multi-word unit (the preferred term within the NLP community) versus phraseological unit (the preferred term in phraseology). The second part will deal with phraseological systems in general, their structure and functioning. Finally, the third part will adopt a contrastive approach, with especial reference to translators’ strategies, procedures and choices. For good or for bad, when it comes to rendering phraseological units, human translation and computer-assisted translation appear to share the same garden path.
    • Computational Phraseology light: automatic translation of multiword expressions without translation resources

      Mitkov, Ruslan (De Gruyter Mouton, 2016-10-01)
      This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proofof- concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.’s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.
    • Do online resources give satisfactory answers to questions about meaning and phraseology?

      Hanks, Patrick; Franklin, Emma (Springer, 2019-09-18)
      In this paper we explore some aspects of the differences between printed paper dictionaries and online dictionaries in the ways in which they explain meaning and phraseology. After noting the importance of the lexicon as an inventory of linguistic items and the neglect in both linguistics and lexicography of phraseological aspects of that inventory, we investigate the treatment in online resources of phraseology – in particular, the phrasal verbs wipe out and put down – and we go on to investigate a word, dope, that has undergone some dramatic meaning changes during the 20th century. In the course of discussion, we mention the new availability of corpus evidence and the technique of Corpus Pattern Analysis, which is important for linking phraseology and meaning and distinguishing normal phraseology from rare and unusual phraseology. The online resources that we discuss include Google, the Urban Dictionary (UD), and Wiktionary.
    • Laughing one's head off in Spanish subtitles: a corpus-based study on diatopic variation and its consequences for translation

      Corpas Pastor, Gloria; Mogorrón, Pedro; Martines, Vicent (John Benjamins, 2018-11-08)
      Looking for phraseological information is common practice among translators. When rendering idioms, information is mostly needed to find the appropriate equivalent, but, also, to check usage and diasystemic restrictions. One of the most complex issues in this respect is diatopic variation. English and Spanish are transnational languages that are spoken in several countries around the globe. Crossvariety differences as regards idiomaticity range from the actual choice of phraseological units, to different lexical or grammatical variants, usage preferences and differential distribution. In this respect, translators are severely underequipped as regards information found in dictionaries. While some diatopic marks are generally used to indicate geographical restrictions, not all idioms are clearly identified and very little information is provided about preferences and/or crucial differences that occur when the same idiom is used in various national varieties. In translation, source language textemes usually turn into target language repertoremes, i.e. established units within the target system. Toury’s law of growing standardisation helps explaining why translated texts tend to be more simple, conventional and prototypical than non-translated texts, among other characteristic features. Provided a substantial part of translational Spanish is composed of textual repertoremes, any source textemes are bound to be ‘dissolved’ into typical ways of expressing in ‘standard’ Spanish. This means filtering source idiomatic diatopy through the ‘neutral, standard sieve’. This paper delves into the rendering into Spanish of the English idiom to laugh one’s head off. After a cursory look at the notions of transnational and translational Spanish(es) in Section 2, Section 3 analyses the translation strategies deployed in a giga-token parallel subcorpus of Spanish-English subtitles. In Section 4, dictionary and textual equivalents retrieved from the parallel corpus are studied against the background of two sets of synonymous idioms for ‘laughing out loud’ in 19 giga-token comparable subcorpora of Spanish national varieties. Corpas Pastor’s (2015) corpus-based research protocol will be adopted in order to uncover varietal differences, detect diatopic configurations and derive consequences for contrastive studies and translation, as summarised in Section 5. This is the first study, to the best of our knowledge, investigating the translation of to laugh one’s head off and also analysing the Spanish equivalent idioms in national and transnational corpora.
    • Profiling idioms: a sociolexical approach to the study of phraseological patterns

      Moze, Sara; Mohamed, Emad (Springer, 2019-09-18)
      This paper introduces a novel approach to the study of lexical and pragmatic meaning called ‘sociolexical profiling’, which aims at correlating the use of lexical items with author-attributed demographic features, such as gender, age, profession, and education. The approach was applied to a case study of a set of English idioms derived from the Pattern Dictionary of English Verbs (PDEV), a corpus-driven lexical resource which defines verb senses in terms of the phraseological patterns in which a verb typically occurs. For each selected idiom, a gender profile was generated based on data extracted from the Blog Authorship Corpus (BAC) in order to establish whether any statistically significant differences can be detected in the way men and women use idioms in every-day communication. A quantitative and qualitative analysis of the gender profiles was subsequently performed, enabling us to test the validity of the proposed approach. If performed on a large scale, we believe that sociolexical profiling will have important implications for several areas of research, including corpus lexicography, translation, creative writing, forensic linguistics, and natural language processing.