AbstractPart of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing (NLP), that assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective etc). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g. dependency parsing) and thereby extract the meaning of the sentence (e.g. semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.
CitationBölücü, N. and Can, B (2021) A Cascaded Unsupervised Model for PoS Tagging. ACM Transactions on Asian and Low-Resource Language Information Processing. 20(1), Article 17
JournalACM Transactions on Asian and Low-Resource Language Information Processing
DescriptionThis is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in March 2021. The accepted version of the publication may differ from the final published version.
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/