Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic
AbstractWhile morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.
CitationMohamed, E. and Sayyed, Z. A. (2019) Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic. c. In Proceedings of Digital Access to Textual Cultural Heritage (DATeCH ’19). ACM, New York, NY, USA, 6 pages.
DescriptionThis is an accepted manuscript of an article published by ACM in DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage in May 2019, available online: https://doi.org/10.1145/3322905.3322927 The accepted version of the publication may differ from the final published version.
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/