Turkish lexicon expansion by using finite state automata
dc.contributor.author | Öztürk, Burak | |
dc.contributor.author | Can, Burcu | |
dc.date.accessioned | 2020-10-12T11:50:18Z | |
dc.date.available | 2020-10-12T11:50:18Z | |
dc.date.issued | 2019-03-22 | |
dc.identifier.citation | Öztürk, M. and Can, B. (2019) Turkish lexicon expansion by using finite state automata, Turkish Journal of Electrical Engineering & Computer Sciences, 27, pp. 1012–1027. | en |
dc.identifier.issn | 1300-0632 | en |
dc.identifier.doi | 10.3906/elk-1804-10 | en |
dc.identifier.uri | http://hdl.handle.net/2436/623708 | |
dc.description | © 2019 The Authors. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdf | en |
dc.description.abstract | Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish. | en |
dc.format | application/pdf | en |
dc.language | en | |
dc.language.iso | en | en |
dc.publisher | Scientific and Technological Research Council of Turkey | en |
dc.relation.url | https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdf | en |
dc.subject | morphology | en |
dc.subject | lexicon expansion | en |
dc.subject | morphological generation | en |
dc.subject | finite-state automata | en |
dc.title | Turkish lexicon expansion by using finite state automata | en |
dc.type | Journal article | en |
dc.identifier.eissn | 1303-6203 | |
dc.identifier.journal | Turkish Journal of Electrical Engineering & Computer Sciences | en |
dc.date.updated | 2020-10-09T11:04:11Z | |
dc.date.accepted | 2018-12-10 | |
rioxxterms.funder | Hacettepe University, Ankara | en |
rioxxterms.identifier.project | UOW12102020BC | en |
rioxxterms.version | VoR | en |
rioxxterms.licenseref.uri | https://creativecommons.org/licenses/by/4.0/ | en |
rioxxterms.licenseref.startdate | 2020-10-12 | en |
dc.source.volume | 27 | |
dc.source.issue | 2 | |
dc.source.beginpage | 1012 | |
dc.source.endpage | 1027 | |
dc.description.version | Published version | |
refterms.dateFCD | 2020-10-12T11:49:02Z | |
refterms.versionFCD | VoR | |
refterms.dateFOA | 2020-10-12T11:50:19Z |