Loading...
Thumbnail Image
Item

Author gender identification for Urdu articles

Sarwar, Raheem
Other contributors
Affiliation
Epub Date
Issue Date
2022-09-21
Submitted date
Alternative
Abstract
In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machines outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.
Citation
Sarwar, R. (2022) Author gender identification for Urdu articles. Lecture Notes in Computer Science, 13528, pp. 221–235.
Publisher
Research Unit
PubMed ID
PubMed Central ID
Embedded videos
Type
Conference contribution
Language
en
Description
This is an accepted manuscript of an article published by Springer in Lecture Notes in Computer Science on 21/09/2022. The accepted version of the publication may differ from the final published version
Series/Report no.
ISSN
0302-9743
EISSN
ISBN
9783031159244
ISMN
Gov't Doc #
Sponsors
Rights
Research Projects
Organizational Units
Journal Issue
Embedded videos