On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks

2.50
Hdl Handle:
http://hdl.handle.net/2436/620980
Title:
On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks
Authors:
Vilares, Jesús; Vilares, Manuel; Alonso, Miguel A.; Oakes, Michael P.
Abstract:
The field of Cross-Language Information Retrieval relates techniques close to both the Machine Translation and Information Retrieval fields, although in a context involving characteristics of its own. The present study looks to widen our knowledge about the effectiveness and applicability to that field of non-classical translation mechanisms that work at character n-gram level. For the purpose of this study, an n-gram based system of this type has been developed. This system requires only a bilingual machine-readable dictionary of n-grams, automatically generated from parallel corpora, which serves to translate queries previously n-grammed in the source language. n-Gramming is then used as an approximate string matching technique to perform monolingual text retrieval on the set of n-grammed documents in the target language. The tests for this work have been performed on CLEF collections for seven European languages, taking English as the target language. After an initial tuning phase in order to analyze the most effective way for its application, the results obtained, close to the upper baseline, not only confirm the consistency across languages of this kind of character n-gram based approaches, but also constitute a further proof of their validity and applicability, these not being tied to a given implementation.
Citation:
On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks 2016, 36:136 Computer Speech & Language
Publisher:
Elsevier
Journal:
Computer Speech & Language
Issue Date:
Mar-2016
URI:
http://hdl.handle.net/2436/620980
DOI:
10.1016/j.csl.2015.09.004
Additional Links:
http://linkinghub.elsevier.com/retrieve/pii/S0885230815000935
Type:
Article
Language:
en
ISSN:
08852308
Appears in Collections:
FOSS

Full metadata record

DC FieldValue Language
dc.contributor.authorVilares, Jesúsen
dc.contributor.authorVilares, Manuelen
dc.contributor.authorAlonso, Miguel A.en
dc.contributor.authorOakes, Michael P.en
dc.date.accessioned2017-12-11T12:51:18Z-
dc.date.available2017-12-11T12:51:18Z-
dc.date.issued2016-03-
dc.identifier.citationOn the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks 2016, 36:136 Computer Speech & Languageen
dc.identifier.issn08852308-
dc.identifier.doi10.1016/j.csl.2015.09.004-
dc.identifier.urihttp://hdl.handle.net/2436/620980-
dc.description.abstractThe field of Cross-Language Information Retrieval relates techniques close to both the Machine Translation and Information Retrieval fields, although in a context involving characteristics of its own. The present study looks to widen our knowledge about the effectiveness and applicability to that field of non-classical translation mechanisms that work at character n-gram level. For the purpose of this study, an n-gram based system of this type has been developed. This system requires only a bilingual machine-readable dictionary of n-grams, automatically generated from parallel corpora, which serves to translate queries previously n-grammed in the source language. n-Gramming is then used as an approximate string matching technique to perform monolingual text retrieval on the set of n-grammed documents in the target language. The tests for this work have been performed on CLEF collections for seven European languages, taking English as the target language. After an initial tuning phase in order to analyze the most effective way for its application, the results obtained, close to the upper baseline, not only confirm the consistency across languages of this kind of character n-gram based approaches, but also constitute a further proof of their validity and applicability, these not being tied to a given implementation.en
dc.language.isoenen
dc.publisherElsevieren
dc.relation.urlhttp://linkinghub.elsevier.com/retrieve/pii/S0885230815000935en
dc.rightsArchived with thanks to Computer Speech & Languageen
dc.subjectCross-Language Information Retrievalen
dc.subjectCharacter n-gramsen
dc.subjectAlignment algorithms for machine translationen
dc.titleOn the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasksen
dc.typeArticleen
dc.identifier.journalComputer Speech & Languageen
All Items in WIRE are protected by copyright, with all rights reserved, unless otherwise indicated.