Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic
Average rating
Cast your vote
You can rate an item by clicking the amount of stars they wish to award to this item.
When enough users have cast their vote on this item, the average rating will also be shown.
Star rating
Your vote was cast
Thank you for your feedback
Thank you for your feedback
Issue Date
2019-05-31
Metadata
Show full item recordAbstract
While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.Citation
Mohamed, E. and Sayyed, Z. A. (2019) Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic. c. In Proceedings of Digital Access to Textual Cultural Heritage (DATeCH ’19). ACM, New York, NY, USA, 6 pages.Publisher
ACMAdditional Links
https://dl.acm.org/doi/10.1145/3322905.3322927Type
Conference contributionLanguage
enDescription
This is an accepted manuscript of an article published by ACM in DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage in May 2019, available online: https://doi.org/10.1145/3322905.3322927 The accepted version of the publication may differ from the final published version.ISBN
9781450371940ae974a485f413a2113503eed53cd6c53
10.1145/3322905.3322927
Scopus Count
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/