Authors
Sarwar, RaheemEditors
Corpas Pastor, GloriaMitkov, Ruslan
Issue Date
2022-09-21
Metadata
Show full item recordAbstract
In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machines outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.Citation
Sarwar, R. (2022) Author gender identification for Urdu articles. Lecture Notes in Computer Science, 13528, pp. 221–235.Publisher
SpringerJournal
Lecture Notes in Computer ScienceAdditional Links
https://link.springer.com/chapter/10.1007/978-3-031-15925-1_16Type
Conference contributionLanguage
enDescription
This is an accepted manuscript of an article published by Springer in Lecture Notes in Computer Science on 21/09/2022. The accepted version of the publication may differ from the final published versionISSN
0302-9743ISBN
9783031159244ae974a485f413a2113503eed53cd6c53
10.1007/978-3-031-15925-1_16
Scopus Count
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/