AbstractSocial media has become a rich data source for natural language processing tasks with its worldwide use; however, it is hard to process social media data due to its informal nature. Text normalization is the task of transforming the noisy text into its canonical form. It generally serves as a preprocessing task in other NLP tasks that are applied to noisy text. In this study, we apply two approaches for Turkish text normalization: Contextual Normalization approach using distributed representations of words and Sequence-to-Sequence Normalization approach using neural encoder-decoder models. As the approaches applied to Turkish and also other languages are mostly rule-based, additional rules are required to be added to the normalization model in order to detect new error patterns arising from the change of the language use in social media. In contrast to rule-based approaches, the proposed approaches provide the advantage of normalizing different error patterns that change over time by training with a new dataset and updating the normalization model. Therefore, the proposed methods provide a solution to language change dependency in social media by updating the normalization model without defining new rules.
CitationGöker, S. and Can, B. (2018) Neural text normalization for Turkish social media, 2018 3rd International Conference on Computer Science and Engineering (UBMK), 20-23 September, 2018, Sarajevo, Bosnia-Herzegovina.
DescriptionThis is an accepted manuscript of an article published by IEEE in 2018 3rd International Conference on Computer Science and Engineering (UBMK) on 10/12/2018, available online: https://ieeexplore.ieee.org/document/8566406 The accepted version of the publication may differ from the final published version.
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/