Loading...
Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic
Mohamed, Emad ; Sayed, Zeeshan
Mohamed, Emad
Sayed, Zeeshan
Authors
Editors
Other contributors
Affiliation
Epub Date
Issue Date
2019-05-31
Submitted date
Alternative
Abstract
While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.
Citation
Mohamed, E. and Sayyed, Z. A. (2019) Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic. c. In Proceedings of Digital Access to Textual Cultural Heritage (DATeCH ’19). ACM, New York, NY, USA, 6 pages.
Publisher
Journal
Research Unit
PubMed ID
PubMed Central ID
Embedded videos
Additional Links
Type
Conference contribution
Language
en
Description
This is an accepted manuscript of an article published by ACM in DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage in May 2019, available online: https://doi.org/10.1145/3322905.3322927
The accepted version of the publication may differ from the final published version.
Series/Report no.
ISSN
EISSN
ISBN
9781450371940