Loading...
Thumbnail Image
Item

Turkish lexicon expansion by using finite state automata

Öztürk, Burak
Can, Burcu
Editors
Other contributors
Affiliation
Epub Date
Issue Date
2019-03-22
Submitted date
Alternative
Abstract
Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.
Citation
Öztürk, M. and Can, B. (2019) Turkish lexicon expansion by using finite state automata, Turkish Journal of Electrical Engineering & Computer Sciences, 27, pp. 1012–1027.
Research Unit
PubMed ID
PubMed Central ID
Embedded videos
Type
Journal article
Language
en
Description
© 2019 The Authors. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdf
Series/Report no.
ISSN
1300-0632
EISSN
1303-6203
ISBN
ISMN
Gov't Doc #
Sponsors
Rights
Research Projects
Organizational Units
Journal Issue
Embedded videos