Incorporating word embeddings in unsupervised morphological segmentation
Average rating
Cast your vote
You can rate an item by clicking the amount of stars they wish to award to this item.
When enough users have cast their vote on this item, the average rating will also be shown.
Star rating
Your vote was cast
Thank you for your feedback
Thank you for your feedback
Issue Date
2020-07-10
Metadata
Show full item recordAbstract
© The Author(s), 2020. Published by Cambridge University Press. We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.Citation
Üstün, A., & Can, B. (2020). Incorporating word embeddings in unsupervised morphological segmentation. Natural Language Engineering, 1-21. doi:10.1017/S1351324920000406Publisher
Cambridge University Press (CUP)Journal
Natural Language EngineeringType
Journal articleLanguage
enDescription
This is an accepted manuscript of an article published by Cambridge University Press in Natural Language Engineering on 10/07/2020, available online: https://doi.org/10.1017/S1351324920000406 The accepted version of the publication may differ from the final published version.ISSN
1351-3249EISSN
1469-8110Sponsors
This research was supported by TUBITAK (The Scientific and Technological Research Council of Turkey) with grant number 115E464.ae974a485f413a2113503eed53cd6c53
10.1017/S1351324920000406
Scopus Count
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/