Show simple item record

dc.contributor.authorMohamed, Emad
dc.contributor.authorSayed, Zeeshan
dc.date.accessioned2020-03-25T11:41:46Z
dc.date.available2020-03-25T11:41:46Z
dc.date.issued2019-05-31
dc.identifier.citationMohamed, E. and Sayyed, Z. A. (2019) Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic. c. In Proceedings of Digital Access to Textual Cultural Heritage (DATeCH ’19). ACM, New York, NY, USA, 6 pages.en
dc.identifier.isbn9781450371940en
dc.identifier.doi10.1145/3322905.3322927en
dc.identifier.urihttp://hdl.handle.net/2436/623162
dc.descriptionThis is an accepted manuscript of an article published by ACM in DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage in May 2019, available online: https://doi.org/10.1145/3322905.3322927 The accepted version of the publication may differ from the final published version.en
dc.description.abstractWhile morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.en
dc.formatapplication/pdfen
dc.language.isoenen
dc.publisherACMen
dc.relation.urlhttps://dl.acm.org/doi/10.1145/3322905.3322927en
dc.subjectNLPen
dc.subjectMachine Learningen
dc.subjectBoostingen
dc.subjectMorphological analysisen
dc.subjectArabicen
dc.titleArabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabicen
dc.typeConference contributionen
dc.date.updated2019-06-04T13:30:00Z
dc.conference.nameDatech 2019
dc.conference.locationBelgium
pubs.finish-date2019-05-10
pubs.start-date2019-05-08
dc.date.accepted2019-05-10
rioxxterms.funderUniversity of Wolverhampton
rioxxterms.identifier.projectUOW25032020EMen
rioxxterms.versionAMen
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/en
rioxxterms.licenseref.startdate2020-05-31en
refterms.dateFCD2020-03-25T11:40:35Z
refterms.versionFCDAM


Files in this item

Thumbnail
Name:
EmadZeehsanDatech2019.pdf
Embargo:
2020-05-31
Size:
785.4Kb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by-nc-nd/4.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/