Loading...
Thumbnail Image
Item

Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic

Mohamed, Emad
Sayed, Zeeshan
Editors
Other contributors
Affiliation
Epub Date
Issue Date
2019-05-31
Submitted date
Alternative
Abstract
While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.
Citation
Mohamed, E. and Sayyed, Z. A. (2019) Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic. c. In Proceedings of Digital Access to Textual Cultural Heritage (DATeCH ’19). ACM, New York, NY, USA, 6 pages.
Publisher
Journal
Research Unit
PubMed ID
PubMed Central ID
Embedded videos
Type
Conference contribution
Language
en
Description
This is an accepted manuscript of an article published by ACM in DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage in May 2019, available online: https://doi.org/10.1145/3322905.3322927 The accepted version of the publication may differ from the final published version.
Series/Report no.
ISSN
EISSN
ISBN
9781450371940
ISMN
Gov't Doc #
Sponsors
Rights
Research Projects
Organizational Units
Journal Issue
Embedded videos