Computational Phraseology light: automatic translation of multiword expressions without translation resources

2.50
Hdl Handle:
http://hdl.handle.net/2436/620324
Title:
Computational Phraseology light: automatic translation of multiword expressions without translation resources
Authors:
Mitkov, Ruslan
Abstract:
This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proofof- concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.’s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.
Citation:
Yearbook of Phraseology. Volume 7, Issue 1, Pages 149–166, ISSN (Online) 1868-6338, ISSN (Print) 1868-632X, DOI: https://doi.org/10.1515/phras-2016-0008, November 2016
Publisher:
De Gruyter Mouton
Journal:
Yearbook of Phraseology
Issue Date:
Nov-2016
URI:
http://hdl.handle.net/2436/620324
Additional Links:
https://www.degruyter.com/view/j/yop.2016.7.issue-1/phras-2016-0008/phras-2016-0008.xml?format=INT
Type:
Article
Language:
en
ISSN:
1868-6338
Appears in Collections:
Computational Linguistics Group

Full metadata record

DC FieldValue Language
dc.contributor.authorMitkov, Ruslanen
dc.date.accessioned2017-01-05T09:37:04Z-
dc.date.available2017-01-05T09:37:04Z-
dc.date.issued2016-11-
dc.identifier.citationYearbook of Phraseology. Volume 7, Issue 1, Pages 149–166, ISSN (Online) 1868-6338, ISSN (Print) 1868-632X, DOI: https://doi.org/10.1515/phras-2016-0008, November 2016en
dc.identifier.issn1868-6338en
dc.identifier.urihttp://hdl.handle.net/2436/620324-
dc.description.abstractThis paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proofof- concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.’s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.en
dc.language.isoenen
dc.publisherDe Gruyter Moutonen
dc.relation.urlhttps://www.degruyter.com/view/j/yop.2016.7.issue-1/phras-2016-0008/phras-2016-0008.xml?format=INTen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectComputational Linguisticsen
dc.subjectPhraseologyen
dc.subjectMultiword Expressionsen
dc.titleComputational Phraseology light: automatic translation of multiword expressions without translation resourcesen
dc.typeArticleen
dc.identifier.journalYearbook of Phraseologyen
dc.date.accepted2016-11-
rioxxterms.funderInternalen
rioxxterms.identifier.projectUoW051217RMen
rioxxterms.versionVoRen
rioxxterms.licenseref.urihttps://creativecommons.org/CC BY-NC-ND 4.0en
rioxxterms.licenseref.startdate2017-11-07en
This item is licensed under a Creative Commons License
Creative Commons
All Items in WIRE are protected by copyright, with all rights reserved, unless otherwise indicated.