Loading...
Thumbnail Image
Item

Handling cross and out-of-domain samples in Thai word segmentation

Limkonchotiwat, Peerat
Phatthiyaphaibun, Wannaphong
Sarwar, Raheem
Chuangsuwanich, Ekapol
Nutanong, Sarana
Alternative
Abstract
While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.
Citation
Limkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E. and Nutanong, S. (2021) Handling cross and out-of-domain samples in Thai word segmentation. In: Zong, C., Xia, F., Li, W. and Navigli, R., (eds.) Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 01-06 Aug 2021, Bangkok, Thailand (virtual conference). Association for Computational Linguistics (ACL), pp. 1003–1016.
Journal
Research Unit
PubMed ID
PubMed Central ID
Embedded videos
Type
Conference contribution
Language
en
Description
© 2021 The Authors. Published by ACL. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://aclanthology.org/2021.findings-acl.86
Series/Report no.
ISSN
EISSN
ISBN
ISMN
Gov't Doc #
Sponsors
Rights
Research Projects
Organizational Units
Journal Issue
Embedded videos