Show simple item record

dc.contributor.authorBolucu, Necva
dc.contributor.authorCan, Burcu
dc.date.accessioned2020-09-03T15:34:50Z
dc.date.available2020-09-03T15:34:50Z
dc.date.issued2019-01-25
dc.identifier.citationBölücü, N. and Can, B. (2019) Unsupervised joint PoS tagging and stemming for agglutinative languages. ACM Transactions on Asian and Low-Resource Language Information Processing 18 (3): 25. DOI: 10.1145/3292398en
dc.identifier.doi10.1145/3292398en
dc.identifier.urihttp://hdl.handle.net/2436/623587
dc.descriptionThis is an accepted manuscript of an article published by Association for Computing Machinery (ACM) in ACM Transactions on Asian and Low-Resource Language Information Processing on 25/01/2019, available online: https://doi.org/10.1145/3292398 The accepted version of the publication may differ from the final published version.en
dc.description.abstractThe number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.en
dc.description.sponsorshipThis research is supported by the Scientific and Technological Research Council of Turkey (TUBITAK) with the project number EEEAG-115E464.en
dc.formatapplication/pdfen
dc.languageen
dc.language.isoenen
dc.publisherAssociation for Computing Machinery (ACM)en
dc.relation.urlhttps://dl.acm.org/doi/10.1145/3292398en
dc.subjectunsupervised learningen
dc.subjectpart-of-speech (PoS) taggingen
dc.subjectstemmingen
dc.subjectjoint learningen
dc.subjectneural word embeddingsen
dc.subjecthidden Markov models (HMM)en
dc.titleUnsupervised joint PoS tagging and stemming for agglutinative languagesen
dc.typeJournal articleen
dc.identifier.eissn2375-4702
dc.identifier.journalACM Transactions on Asian and Low-Resource Language Information Processingen
dc.date.updated2020-08-26T08:33:33Z
dc.identifier.articlenumber25
dc.date.accepted2018-11-01
rioxxterms.funderTUBITAKen
rioxxterms.identifier.projectEEEAG-115E464en
rioxxterms.versionAMen
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/en
rioxxterms.licenseref.startdate2020-09-03en
dc.source.volume18
dc.source.issue3
dc.description.versionPublished version
refterms.dateFCD2020-09-03T11:08:56Z
refterms.versionFCDAM
refterms.dateFOA2020-09-03T00:00:00Z


Files in this item

Thumbnail
Name:
Buglalilar_Unsupervised_Joint_ ...
Size:
1.551Mb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by-nc-nd/4.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/