Abstract
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.Citation
Sarwar, R., Hassan, S. (2021) UrduAI: writeprints for Urdu authorship identification. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(2), 34, pp. 1–18. https://doi.org/10.1145/3476467Publisher
Association for Computing MachineryJournal
ACM Transactions on Asian and Low-Resource Language Information ProcessingDOI
10.1145/3476467Additional Links
https://dl.acm.org/doi/10.1145/3476467Type
Journal articleLanguage
enDescription
This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing on 31/10/2021, available online at: https://doi.org/10.1145/3476467 The accepted version of the publication may differ from the final published version.ISSN
2375-4699ae974a485f413a2113503eed53cd6c53
10.1145/3476467
Scopus Count
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/