Native language identification of fluent and advanced non-native writers
dc.contributor.author | Sarwar, Raheem | |
dc.contributor.author | Rutherford, Attapol T | |
dc.contributor.author | Hassan, Saeed-Ul | |
dc.contributor.author | Rakthanmanon, Thanawin | |
dc.contributor.author | Nutanong, Sarana | |
dc.date.accessioned | 2020-10-13T08:49:28Z | |
dc.date.available | 2020-10-13T08:49:28Z | |
dc.date.issued | 2020-04-30 | |
dc.identifier.citation | Sarwar, R., Rutherford, A.T., Hassan, S.U., Rakthanmanon, T. and Nutanong, S. (2020) Native language identification of fluent and advanced non-native writers, ACM Transactions on Asian and Low-Resource Language Information Processing, 19(4), 55. https://doi.org/10.1145/3383202 | en |
dc.identifier.issn | 2375-4699 | en |
dc.identifier.doi | 10.1145/3383202 | en |
dc.identifier.uri | http://hdl.handle.net/2436/623710 | |
dc.description | This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version. | en |
dc.description.abstract | Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages. | en |
dc.description.sponsorship | Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175). | en |
dc.format | application/pdf | en |
dc.language | en | |
dc.language.iso | en | en |
dc.publisher | Association for Computing Machinery (ACM) | en |
dc.relation.url | https://dl.acm.org/doi/10.1145/3383202 | en |
dc.subject | Stylometry | en |
dc.subject | forensic investigation | en |
dc.subject | native language identification | en |
dc.subject | author profiling | en |
dc.subject | stylometry | en |
dc.subject | text classification | en |
dc.title | Native language identification of fluent and advanced non-native writers | en |
dc.type | Journal article | en |
dc.identifier.eissn | 2375-4702 | |
dc.identifier.journal | ACM Transactions on Asian and Low-Resource Language Information Processing | en |
dc.date.updated | 2020-10-07T17:39:23Z | |
dc.date.accepted | 2020-02-10 | |
rioxxterms.funder | DEPA | en |
rioxxterms.identifier.project | MP-62-0003 | en |
rioxxterms.version | AM | en |
rioxxterms.licenseref.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ | en |
rioxxterms.licenseref.startdate | 2020-10-13 | en |
dc.source.volume | 19 | |
dc.source.issue | 4 | |
dc.source.beginpage | 1 | |
dc.source.endpage | 19 | |
dc.description.version | Published version | |
refterms.dateFCD | 2020-10-13T08:39:41Z | |
refterms.versionFCD | AM | |
refterms.dateFOA | 2020-10-13T08:49:29Z |