Show simple item record

dc.contributor.authorSarwar, R
dc.contributor.authorLi, Q
dc.contributor.authorRakthanmanon, T
dc.contributor.authorNutanong, S
dc.date.accessioned2020-10-09T11:13:33Z
dc.date.available2020-10-09T11:13:33Z
dc.date.issued2018-07-10
dc.identifier.citationSarwar, R., Li, Q., Rakthanmanon, T. and Nutanong, S. (2018) A scalable framework for cross-lingual authorship identification, Information Sciences, 465, pp. 323-339.en
dc.identifier.issn0020-0255en
dc.identifier.doi10.1016/j.ins.2018.07.009en
dc.identifier.urihttp://hdl.handle.net/2436/623702
dc.descriptionThis is an accepted manuscript of an article published by Elsevier in Information Sciences on 10/07/2018, available online: https://doi.org/10.1016/j.ins.2018.07.009 The accepted version of the publication may differ from the final published version.en
dc.description.abstract© 2018 Elsevier Inc. Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.en
dc.formatapplication/pdfen
dc.languageen
dc.language.isoenen
dc.publisherElsevieren
dc.relation.urlhttps://www.sciencedirect.com/science/article/pii/S0020025518305231?via%3Dihub#!en
dc.subjectsimilarity searchen
dc.subjectauthorship identificationen
dc.subjectWriteprinten
dc.subjectstylometric featuresen
dc.subjectcyber forensicen
dc.subjectcross-lingualen
dc.titleA scalable framework for cross-lingual authorship identificationen
dc.typeJournal articleen
dc.identifier.eissn1872-6291
dc.identifier.journalInformation Sciencesen
dc.date.updated2020-10-07T18:12:33Z
dc.date.accepted2018-07-07
rioxxterms.funderCity University of Hong Kongen
rioxxterms.identifier.projectUOW09102020RSen
rioxxterms.versionAMen
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/en
rioxxterms.licenseref.startdate2020-10-09en
dc.source.volume465
dc.source.beginpage323
dc.source.endpage339
dc.description.versionPublished version
refterms.dateFCD2020-10-09T11:11:00Z
refterms.versionFCDAM
refterms.dateFOA2020-10-09T11:13:33Z


Files in this item

Thumbnail
Name:
Publisher version
Thumbnail
Name:
Sarwar_et_al_A_scalable_framew ...
Size:
1.216Mb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by-nc-nd/4.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/