Öztürk, BurakCan, Burcu2020-10-122020-10-122019-03-22Öztürk, M. and Can, B. (2019) Turkish lexicon expansion by using finite state automata, Turkish Journal of Electrical Engineering & Computer Sciences, 27, pp. 1012–1027.1300-063210.3906/elk-1804-10http://hdl.handle.net/2436/623708© 2019 The Authors. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdfTurkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.application/pdfenmorphologylexicon expansionmorphological generationfinite-state automataTurkish lexicon expansion by using finite state automataJournal article1303-6203Turkish Journal of Electrical Engineering & Computer Sciences2020-10-09