Robust fragment-based framework for cross-lingual sentence retrieval
Yih, Scott Wen-tau
MetadataShow full item record
AbstractCross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of- Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.
CitationTrijakwanich, N., Limkonchotiwat, P., Sarwar, R., Phatthiyaphaibun, W., Chuangsuwanich, E. and Nutanong, S. (2021) Robust fragment-based framework for cross-lingual sentence retrieval. Findings of the Association for Computational Linguistics: EMNLP 2021. Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih (Editors) : Association for Computational Linguistics.Pp.935–944.
Description© 2021 The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://aclanthology.org/2021.findings-emnlp.80
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by/4.0/