Loading...
Thumbnail Image
Item

Native language identification of fluent and advanced non-native writers

Sarwar, Raheem
Rutherford, Attapol T
Hassan, Saeed-Ul
Rakthanmanon, Thanawin
Nutanong, Sarana
Alternative
Abstract
Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.
Citation
Sarwar, R., Rutherford, A.T., Hassan, S.U., Rakthanmanon, T. and Nutanong, S. (2020) Native language identification of fluent and advanced non-native writers, ACM Transactions on Asian and Low-Resource Language Information Processing, 19(4), 55. https://doi.org/10.1145/3383202
Research Unit
PubMed ID
PubMed Central ID
Embedded videos
Type
Journal article
Language
en
Description
This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version.
Series/Report no.
ISSN
2375-4699
EISSN
2375-4702
ISBN
ISMN
Gov't Doc #
Sponsors
Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175).
Rights
Research Projects
Organizational Units
Journal Issue
Embedded videos