EditorsCorpas Pastor, Gloria
MetadataShow full item record
AbstractIn recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machines outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.
CitationSarwar, R. (2022) Author gender identification for Urdu articles. Lecture Notes in Computer Science, 13528, pp. 221–235.
JournalLecture Notes in Computer Science
DescriptionThis is an accepted manuscript of an article published by Springer in Lecture Notes in Computer Science on 21/09/2022. The accepted version of the publication may differ from the final published version
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/