• Sarcasm target identification with LSTM networks

      Bölücü, Necva; Can, Burcu (IEEE, 2021-01-07)
      Geçmi¸s yıllarda, kinayeli metinler üzerine yapılan çalı¸smalarda temel hedef metinlerin kinaye içerip içermediginin ˘ tespit edilmesiydi. Sosyal medya kullanımı ile birlikte siber zorbalıgın yaygınla¸sması, metinlerin sadece kinaye içerip içer- ˘ mediginin tespit edilmesinin yanısıra kinayeli metindeki hedefin ˘ belirlenmesini de gerekli kılmaya ba¸slamı¸stır. Bu çalı¸smada, kinayeli metinlerde hedef tespiti için bir derin ögrenme modeli ˘ kullanılarak hedef tespiti yapılmı¸s ve elde edilen sonuçlar literatürdeki ˙Ingilizce üzerine olan benzer çalı¸smalarla kıyaslanmı¸stır. Sonuçlar, önerdigimiz modelin kinaye hedef tespitinde benzer ˘ çalı¸smalara göre daha iyi çalı¸stıgını göstermektedir. The earlier work on sarcastic texts mainly concentrated on detecting the sarcasm on a given text. With the spread of cyber-bullying with the use of social media, it becomes also essential to identify the target of the sarcasm besides detecting the sarcasm. In this study, we propose a deep learning model for target identification on sarcastic texts and compare it with other work on English. The results show that our model outperforms the related work on sarcasm target identification.
    • A scalable framework for cross-lingual authorship identification

      Sarwar, R; Li, Q; Rakthanmanon, T; Nutanong, S (Elsevier, 2018-07-10)
      © 2018 Elsevier Inc. Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.
    • A scalable framework for stylometric analysis of multi-author documents

      Sarwar, Raheem; Yu, Chenyun; Nutanong, Sarana; Urailertprasert, Norawit; Vannaboot, Nattapol; Rakthanmanon, Thanawin; Pei, Jian; Manolopoulos, Yannis; Sadiq, Shazia W; Li, Jianxin (Springer, 2018-05-13)
      Stylometry is a statistical technique used to analyze the variations in the author’s writing styles and is typically applied to authorship attribution problems. In this investigation, we apply stylometry to authorship identification of multi-author documents (AIMD) task. We propose an AIMD technique called Co-Authorship Graph (CAG) which can be used to collaboratively attribute different portions of documents to different authors belonging to the same community. Based on CAG, we propose a novel AIMD solution which (i) significantly outperforms the existing state-of-the-art solution; (ii) can effectively handle a larger number of co-authors; and (iii) is capable of handling the case when some of the listed co-authors have not contributed to the document as a writer. We conducted an extensive experimental study to compare the proposed solution and the best existing AIMD method using real and synthetic datasets. We show that the proposed solution significantly outperforms existing state-of-the-art method.
    • A scalable framework for stylometric analysis query processing

      Nutanong, Sarana; Yu, Chenyun; Sarwar, Raheem; Xu, Peter; Chow, Dickson (IEEE, 2017-02-02)
      Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.
    • Scientific web intelligence: finding relationships in university webs

      Thelwall, Mike (ACM, 2005)
      Methods for analyzing university Web sites demonstrate strong patterns that can reveal interconnections between research fields.
    • Semantic textual similarity with siamese neural networks

      Orasan, Constantin; Mitkov, Ruslan; Ranasinghe, Tharindu (RANLP, 2019-09-02)
      Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods
    • SemEval-2021 task 1: Lexical complexity prediction

      Shardlow, Matthew; Evans, Richard; Paetzold, Gustavo Henrique; Zampieri, Marcos (Association for Computational Linguistics, 2021-08-01)
      This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.
    • Sentence simplification for semantic role labelling and information extraction

      Evans, Richard; Orasan, Constantin (RANLP, 2019-09-02)
      In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.
    • Sentiment analysis for Urdu online reviews using deep learning models

      Safder, Iqra; Mehmood, Zainab; Sarwar, Raheem; Hassan, Saeed-Ul; Zaman, Farooq; Adeel Nawab, Rao Muhammad; Bukhari, Faisal; Ayaz Abbasi, Rabeeh; Alelyani, Salem; Radi Aljohani, Naif; et al. (Wiley, 2021-06-28)
      Most existing studies are focused on popular languages like English, Spanish, Chinese, Japanese, and others, however, limited attention has been paid to Urdu despite having more than 60 million native speakers. In this paper, we develop a deep learning model for the sentiments expressed in this under-resourced language. We develop an open-source corpus of 10,008 reviews from 566 online threads on the topics of sports, food, software, politics, and entertainment. The objectives of this work are bi-fold (1) the creation of a human-annotated corpus for the research of sentiment analysis in Urdu; and (2) measurement of up-to-date model performance using a corpus. For their assessment, we performed binary and ternary classification studies utilizing another model, namely LSTM, RCNN Rule-Based, N-gram, SVM, CNN, and LSTM. The RCNN model surpasses standard models with 84.98 % accuracy for binary classification and 68.56 % accuracy for ternary classification. To facilitate other researchers working in the same domain, we have open-sourced the corpus and code developed for this research.
    • A sequence labelling approach for automatic analysis of ello: tagging pronouns, antecedents, and connective phrases

      Parodi, Giovanni; Evans, Richard; Ha, Le An; Mitkov, Ruslan; Julio, Cristóbal; Olivares-López, Raúl Ignacio (Springer, 2021-09-04)
      Encapsulators are linguistic units which establish coherent referential connections to the preceding discourse in a text. In this paper, we address the challenge of automatically analysing the pronominal encapsulator ello in Spanish text. Our method identifies, for each occurrence, the antecedent of the pronoun (including its grammatical type), the connective phrase which combines with the pronoun to express a discourse relation linking the antecedent text segment to the following text segment, and the type of semantic relation expressed by the complex discourse marker formed by the connective phrase and pronoun. We describe our annotation of a corpus to inform the development of our method and to finetune an automatic analyser based on bidirectional encoder representation transformers (BERT). On testing our method, we find that it performs with greater accuracy than three baselines (0.76 for the resolution task), and sets a promising benchmark for the automatic annotation of occurrences of the pronoun ello, their antecedents, and the semantic relations between the two text segments linked by the connective in combination with the pronoun.
    • SHEF-NN: translation quality estimation with neural networks

      Shah, Kashif; Logacheva, Varvara; Paetzold, G; Blain, Frederic; Beck, Daniel; Bougares, Fethi; Specia, Lucia (Association for Computational Linguistics, 2015-09-30)
      We describe our systems for Tasks 1 and 2 of the WMT15 Shared Task on Quality Estimation. Our submissions use (i) a continuous space language model to extract additional features for Task 1 (SHEFGP, SHEF-SVM), (ii) a continuous bagof-words model to produce word embeddings as features for Task 2 (SHEF-W2V) and (iii) a combination of features produced by QuEst++ and a feature produced with word embedding models (SHEFQuEst++). Our systems outperform the baseline as well as many other submissions. The results are especially encouraging for Task 2, where our best performing system (SHEF-W2V) only uses features learned in an unsupervised fashion.
    • Sheffield submissions for the WMT18 quality estimation shared task

      Ive, Julia; Scarton, Carolina; Blain, Frédéric; Specia, Lucia (Association for Computational Linguistics, 2018-10)
      In this paper we present the University of Sheffield submissions for the WMT18 Quality Estimation shared task. We discuss our submissions to all four sub-tasks, where ours is the only team to participate in all language pairs and variations (37 combinations). Our systems show competitive results and outperform the baseline in nearly all cases.
    • Sheffield systems for the English-Romanian translation task

      Blain, Frédéric; Song, Xingyi; Specia, Lucia (Association for Computational Linguistics, 2016-08)
    • She’s Reddit: A source of statistically significant gendered interest information

      Thelwall, Mike; Stuart, Emma (Elsevier, 2018-10-19)
      Information about gender differences in interests is necessary to disentangle the effects of discrimination and choice when gender inequalities occur, such as in employment. This article assesses gender differences in interests within the popular social news and entertainment site Reddit. A method to detect terms that are statistically significantly used more by males or females in 181 million comments in 100 subreddits shows that gender affects both the selection of subreddits and activities within most of them. The method avoids the hidden gender biases of topic modelling for this task. Although the method reveals statistically significant gender differences in interests for topics that are extensively discussed on Reddit, it cannot give definitive causes, and imitation and sharing within the site mean that additional checking is needed to verify the results. Nevertheless, with care, Reddit can serve as a useful source of insights into gender differences in interests.
    • Should citations be counted separately from each originating section?

      Thelwall, Mike (Elsevier, 2019-04-03)
      Articles are cited for different purposes and differentiating between reasons when counting citations may therefore give finer-grained citation count information. Although identifying and aggregating the individual reasons for each citation may be impractical, recording the number of citations that originate from different article sections might illuminate the general reasons behind a citation count (e.g., 110 citations = 10 Introduction citations + 100 Methods citations). To help investigate whether this could be a practical and universal solution, this article compares 19 million citations with DOIs from six different standard sections in 799,055 PubMed Central open access articles across 21 out of 22 fields. There are apparently non-systematic differences between fields in the most citing sections and the extent to which citations from one section overlap with citations from another, with some degree of overlap in most cases. Thus, at a science-wide level, section headings are partly unreliable indicators of citation context, even if they are more standard within individual fields. They may still be used within fields to help identify individual highly cited articles that have had one type of impact, especially methodological (Methods) or context setting (Introduction), but expert judgement is needed to validate the results.
    • Six good predictors of autistic reading comprehension

      Yaneva, Victoria; Evans, Richard (INCOMA Ltd, 2015-09-07)
      This paper presents our investigation of the ability of 33 readability indices to account for the reading comprehension difficulty posed by texts for people with autism. The evaluation by autistic readers of 16 text passages is described, a process which led to the production of the first text collection for which readability has been evaluated by people with autism. We present the findings of a study to determine which of the 33 indices can successfully discriminate between the difficulty levels of the text passages, as determined by our reading experiment involving autistic participants. The discriminatory power of the indices is further assessed through their application to the FIRST corpus which consists of 25 texts presented in their original form and in a manually simplified form (50 texts in total), produced specifically for readers with autism.
    • Size Matters: A Quantitative Approach to Corpus Representativeness

      Corpas Pastor, Gloria; Seghiri Domínguez, Míriam; Rabadán, Rosa (Publicaciones Universidad de León, 2010-06-01)
      We should always bear in mind that the assumption of representativeness ‘must be regarded largely as an act of faith’ (Leech 1991: 2), as at present we have no means of ensuring it, or even evaluating it objectively. (Tognini-Bonelli 2001: 57) Corpus Linguistics (CL) has not yet come of age. It does not make any difference whether we consider it a full-fledged linguistic discipline (Tognini-Bonelli 2000: 1) or, else, a set of analytical techniques that can be applied to any discipline (McEnery et al. 2006: 7). The truth is that CL is still striving to solve thorny, central issues such as optimum size, balance and representativeness of corpora (of the language as a whole or of some subset of the language). Corpus-driven/based studies rely on the quality and representativeness of each corpus as their true foundation for producing valid results. This entails deciding on valid external and internal criteria for corpus design and compilation. A basic tenet is that corpus representativeness determines the kinds of research questions that can be addressed and the generalizability of the results obtained (cf. Biber et al. 1988: 246). Unfortunately, faith and beliefs do not seem to ensure quality. In this paper we will attempt to deal with these key questions. Firstly, we will give a brief description of the R&D projects which originally have served as the main framework for this research. Secondly, we will focus on the complex notion of corpus representativeness and ideal size, from both a theoretical and an applied perspective. Finally, we will describe a computer application which has been developed as part of the research. This software will be used to verify whether a sample bilingual comparable corpus could be deemed representative.
    • SlideShare presentations, citations, users and trends: A professional site with academic and educational uses

      Thelwall, Mike; Kousha, Kayvan (Wiley-Blackwell, 2017-06-01)
      SlideShare is a free social web site that aims to help users to distribute and find presentations. Owned by LinkedIn since 2012, it targets a professional audience but may give value to scholarship through creating a long term record of the content of talks. This article tests this hypothesis by analysing sets of general and scholarly-related SlideShare documents using content and citation analysis and popularity statistics reported on the site. The results suggest that academics, students and teachers are a minority of SlideShare uploaders, especially since 2010, with most documents not being directly related to scholarship or teaching. About two thirds of uploaded SlideShare documents are presentation slides, with the remainder often being files associated with presentations or video recordings of talks. SlideShare is therefore a presentation-centred site with a predominantly professional user base. Although a minority of the uploaded SlideShare documents are cited by, or cite, academic publications, probably too few articles are cited by SlideShare to consider extracting SlideShare citations for research evaluation. Nevertheless, scholars should consider SlideShare to be a potential source of academic and non-academic information, particularly in library and information science, education and business.
    • Social media analytics for YouTube comments: potential and limitations

      Thelwall, Mike; School of Mathematics and Computing, University of Wolverhampton, Wolverhampton, UK (Taylor & Francis, 2017-09-21)
      The need to elicit public opinion about predefined topics is widespread in the social sciences, government and business. Traditional survey-based methods are being partly replaced by social media data mining but their potential and limitations are poorly understood. This article investigates this issue by introducing and critically evaluating a systematic social media analytics strategy to gain insights about a topic from YouTube. The results of an investigation into sets of dance style videos show that it is possible to identify plausible patterns of subtopic difference, gender and sentiment. The analysis also points to the generic limitations of social media analytics that derive from their fundamentally exploratory multi-method nature.
    • Source language difficulties in learner translation: Evidence from an error-annotated corpus

      Kunilovskaia, Mariia; Ilyushchenya, Tatyana; Morgoun, Natalia; Mitkov, Ruslan (John Benjamins Publishing, 2022-06-30)
      This study uses an error-annotated, mass-media subset of a sentence-aligned, multi-parallel learner translator corpus, to reveal source language items that are challenging in English-to-Russian translation. Our data includes multiple translations to most challenging source sentences, distilled from a large collection of student translations on the basis of error statistics. This sample was subjected to manual contrastive-comparative analysis, which resulted in a list of English items that were difficult to students. The outcome of the analysis was compared to the topics discussed in dozens of translation textbooks that are recommended to BA and specialist-degree students in Russia at the initial stage of professional education. We discuss items that deserve more prominence in training as well as items that call for improvements to traditional learning activities. This study presents evidence that a more empirically-motivated design of practical translation syllabus as part of translator education is required.