Show simple item record

dc.contributor.authorSarwar, Raheem
dc.contributor.authorRutherford, Attapol T
dc.contributor.authorHassan, Saeed-Ul
dc.contributor.authorRakthanmanon, Thanawin
dc.contributor.authorNutanong, Sarana
dc.date.accessioned2020-10-13T08:49:28Z
dc.date.available2020-10-13T08:49:28Z
dc.date.issued2020-04-30
dc.identifier.citationSarwar, R., Rutherford, A.T., Hassan, S.U., Rakthanmanon, T. and Nutanong, S. (2020) Native language identification of fluent and advanced non-native writers, ACM Transactions on Asian and Low-Resource Language Information Processing, 19(4), 55. https://doi.org/10.1145/3383202en
dc.identifier.issn2375-4699en
dc.identifier.doi10.1145/3383202en
dc.identifier.urihttp://hdl.handle.net/2436/623710
dc.descriptionThis is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version.en
dc.description.abstractNative Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.en
dc.description.sponsorshipResearch funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175).en
dc.formatapplication/pdfen
dc.languageen
dc.language.isoenen
dc.publisherAssociation for Computing Machinery (ACM)en
dc.relation.urlhttps://dl.acm.org/doi/10.1145/3383202en
dc.subjectStylometryen
dc.subjectforensic investigationen
dc.subjectnative language identificationen
dc.subjectauthor profilingen
dc.subjectstylometryen
dc.subjecttext classificationen
dc.titleNative language identification of fluent and advanced non-native writersen
dc.typeJournal articleen
dc.identifier.eissn2375-4702
dc.identifier.journalACM Transactions on Asian and Low-Resource Language Information Processingen
dc.date.updated2020-10-07T17:39:23Z
dc.date.accepted2020-02-10
rioxxterms.funderDEPAen
rioxxterms.identifier.projectMP-62-0003en
rioxxterms.versionAMen
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/en
rioxxterms.licenseref.startdate2020-10-13en
dc.source.volume19
dc.source.issue4
dc.source.beginpage1
dc.source.endpage19
dc.description.versionPublished version
refterms.dateFCD2020-10-13T08:39:41Z
refterms.versionFCDAM
refterms.dateFOA2020-10-13T08:49:29Z


Files in this item

Thumbnail
Name:
Sarwar_et_al_Native_language_i ...
Size:
1.302Mb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by-nc-nd/4.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/