Show simple item record

dc.contributor.authorLimkonchotiwat, Peerat
dc.contributor.authorPhatthiyaphaibun, Wannaphong
dc.contributor.authorSarwar, Raheem
dc.contributor.authorChuangsuwanich, Ekapol
dc.contributor.authorNutanong, Sarana
dc.date.accessioned2021-06-23T13:57:58Z
dc.date.available2021-06-23T13:57:58Z
dc.date.issued2021-08-01
dc.identifier.citationLimkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E. and Nutanong, S. (2021) Handling cross and out-of-domain samples in Thai word segmentation. In: Zong, C., Xia, F., Li, W. and Navigli, R., (eds.) Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 01-06 Aug 2021, Bangkok, Thailand (virtual conference). Association for Computational Linguistics (ACL), pp. 1003–1016.en
dc.identifier.doi10.18653/v1/2021.findings-acl.86
dc.identifier.urihttp://hdl.handle.net/2436/624145
dc.description© 2021 The Authors. Published by ACL. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://aclanthology.org/2021.findings-acl.86en
dc.description.abstractWhile word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.en
dc.formatapplication/pdfen
dc.language.isoenen
dc.publisherAssociation for Computational Linguisticsen
dc.relation.urlhttps://2021.aclweb.org/en
dc.subjectThaien
dc.subjectword segmentationen
dc.subjectlow-resource NLPen
dc.titleHandling cross and out-of-domain samples in Thai word segmentationen
dc.typeConference contributionen
dc.date.updated2021-06-22T14:25:01Z
dc.conference.nameThe Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)
dc.conference.locationBangkok, Thailand
pubs.finish-date2021-08-06
pubs.start-date2021-08-01
dc.date.accepted2021-05-06
rioxxterms.funderUniversity of Wolverhamptonen
rioxxterms.identifier.projectUOW23062021RSen
rioxxterms.versionVoRen
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by/4.0/en
rioxxterms.licenseref.startdate2021-08-01en
dc.source.beginpage1003
dc.source.endpage1016
refterms.dateFCD2021-06-23T13:57:04Z
refterms.versionFCDVoR
refterms.dateFOA2021-08-01T00:00:00Z


Files in this item

Thumbnail
Name:
2021.findings-acl.86.pdf
Size:
902.5Kb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by/4.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by/4.0/