• Object and subject Heavy-NP shift in Arabic

      Mohamed, Emad (Research in Corpus Linguistics, 2014-12-31)
      In order to examine whether Arabic has Heavy Noun Phrase Shifting (HNPS), I have extracted from the Prague Arabic Dependency Treebank a data set in which a verb governs either an object NP and an Adjunct Phrase (PP or AdvP) or a subject NP and an Adjunct Phrase. I have used binary logistic regression where the criterion variable is whether the subject/object NP shifts, and used as predictor variables heaviness (the number of tokens per NP, adjunct), part of speech tag, verb disposition (ie. whether the verb has a history of taking double objects or sentential objects), NP number, NP definiteness, and the presence of referring pronouns in either the NP or the adjunct. The results show that only object heaviness and adjunct heaviness are useful predictors of object HNPS, while subject heaviness, adjunct heaviness, subject part of speech tag, definiteness, and adjunct head POS tags are active predictors of subject HNPS. I also show that HNPS can in principle be predicted from sentence structure.
    • Overview of the HASOC subtrack at FIRE 2021: Hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech

      Mandl, Thomas; Modha, Sandip; Shahi, Gautam Kishore; Madhu, Hiren; Satapara, Shrey; Majumder, Prasenjit; Schäfer, Johannes; Ranasinghe, Tharindu; Zampieri, Marcos; Nandini, Durgesh; et al. (Association for Computing Machinery, 2021-12-13)
      The HASOC track is dedicated to the evaluation of technology for finding Offensive Language and Hate Speech. HASOC is creating a multilingual data corpus mainly for English and under-resourced languages(Hindi and Marathi). This paper presents one HASOC subtrack with two tasks. In 2021, we organized the classification task for English, Hindi, and Marathi. The first task consists of two classification tasks; Subtask 1A consists of a binary and fine-grained classification into offensive and non-offensive tweets. Subtask 1B asks to classify the tweets into Hate, Profane and offensive. Task 2 consists of identifying tweets given additional context in the form of the preceding conversion. During the shared task, 65 teams have submitted 652 runs. This overview paper briefly presents the task descriptions, the data and the results obtained from the participant's submission.
    • Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization

      Safder, Iqra; Batool, Hafsa; Sarwar, Raheem; Zaman, Farooq; Aljohani, Naif Radi; Nawaz, Raheel; Gaber, Mohamed; Hassan, Saeed-Ul (Taylor & Francis, 2021-11-14)
      Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function's graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/.
    • Patent citation analysis with Google

      Kousha, Kayvan; Thelwall, Mike (Wiley-Blackwell, 2015-09-23)
      Citations from patents to scientific publications provide useful evidence about the commercial impact of academic research, but automatically searchable databases are needed to exploit this connection for large-scale patent citation evaluations. Google covers multiple different international patent office databases but does not index patent citations or allow automatic searches. In response, this article introduces a semiautomatic indirect method via Bing to extract and filter patent citations from Google to academic papers with an overall precision of 98%. The method was evaluated with 322,192 science and engineering Scopus articles from every second year for the period 1996–2012. Although manual Google Patent searches give more results, especially for articles with many patent citations, the difference is not large enough to be a major problem. Within Biomedical Engineering, Biotechnology, and Pharmacology & Pharmaceutics, 7% to 10% of Scopus articles had at least one patent citation but other fields had far fewer, so patent citation analysis is only relevant for a minority of publications. Low but positive correlations between Google Patent citations and Scopus citations across all fields suggest that traditional citation counts cannot substitute for patent citations when evaluating research.
    • Phrase level segmentation and labelling of machine translation errors

      Blain, Frédéric; Logacheva, Varvara; Specia, Lucia; Chair, Nicoletta Calzolari Conference; Choukri, Khalid; Declerck, Thierry; Grobelnik, Marko; Maegaard, Bente; Mariani, Joseph; Moreno, Asuncion; et al. (European Language Resources Association (ELRA), 2016-05)
      This paper presents our work towards a novel approach for Quality Estimation (QE) of machine translation based on sequences of adjacent words, the so-called phrases. This new level of QE aims to provide a natural balance between QE at word and sentence-level, which are either too fine grained or too coarse levels for some applications. However, phrase-level QE implies an intrinsic challenge: how to segment a machine translation into sequence of words (contiguous or not) that represent an error. We discuss three possible segmentation strategies to automatically extract erroneous phrases. We evaluate these strategies against annotations at phrase-level produced by humans, using a new dataset collected for this purpose.
    • The Portrait of Dorian Gray: A corpus-based analysis of translated verb + noun (object) collocations in Peninsular and Colombian Spanish

      Valencia Giraldo, M. Victoria; Corpas Pastor, Gloria (Springer, 2019-09-18)
      Corpus-based Translation Studies have promoted research on the features of translated language, by focusing on the process and product of translation, from a descriptive perspective. Some of these features have been proposed by Toury [31] under the term of laws of translation, namely the law of growing standardisation and the law of interference. The law of standardisation appears to be particularly at play in diatopy, and more specifically in the case of transnational languages (e.g. English, Spanish, French, German). In fact, some studies have revealed the tendency to standardise the diatopic varieties of Spanish in translated language [8, 9, 11, 12]. This paper focuses on verb + noun (object) collocations of Spanish translations of The Portrait of Dorian Gray by Oscar Wilde. Two different varieties have been chosen (Peninsular and Colombian Spanish). Our main aim is to establish whether the Colombian Spanish translation actually matches the variety spoken in Colombia or it is closer to general or standard Spanish. For this purpose, the techniques used to translate this type of collocations in both Spanish translations will be analysed. Furthermore, the diatopic distribution of these collocations will be studied by means of large corpora.
    • Predicting lexical complexity in English texts: the Complex 2.0 dataset

      Shardlow, Matthew; Evans, Richard; Zampieri, Marcos (Springer, 2022-03-23)
      Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
    • Predicting literature's early impact with sentiment analysis in Twitter

      Hassan, SU; Aljohani, NR; Idrees, N; Sarwar, R; Nawaz, R; Martínez-Cámara, E; Ventura, S; Herrera, F (Elsevier, 2019-12-14)
      © 2019 Elsevier B.V. Traditional bibliometric techniques gauge the impact of research through quantitative indices based on the citations data. However, due to the lag time involved in the citation-based indices, it may take years to comprehend the full impact of an article. This paper seeks to measure the early impact of research articles through the sentiments expressed in tweets about them. We claim that cited articles in either positive or neutral tweets have a more significant impact than those not cited at all or cited in negative tweets. We used the SentiStrength tool and improved it by incorporating new opinion-bearing words into its sentiment lexicon pertaining to scientific domains. Then, we classified the sentiment of 6,482,260 tweets linked to 1,083,535 publications covered by Altmetric.com. Using positive and negative tweets as an independent variable, and the citation count as the dependent variable, linear regression analysis showed a weak positive prediction of high citation counts across 16 broad disciplines in Scopus. Introducing an additional indicator to the regression model, i.e. ‘number of unique Twitter users’, improved the adjusted R-squared value of regression analysis in several disciplines. Overall, an encouraging positive correlation between tweet sentiments and citation counts showed that Twitter-based opinion may be exploited as a complementary predictor of literature's early impact.
    • Predicting reading difficulty for readers with autism spectrum disorder

      Evans, Richard; Yaneva, Victoria; Temnikova, Irina (European Language Resources Association, 2016-05-23)
      People with autism experience various reading comprehension difficulties, which is one explanation for the early school dropout, reduced academic achievement and lower levels of employment in this population. To overcome this issue, content developers who want to make their textbooks, websites or social media accessible to people with autism (and thus for every other user) but who are not necessarily experts in autism, can benefit from tools which are easy to use, which can assess the accessibility of their content, and which are sensitive to the difficulties that autistic people might have when processing texts/websites. In this paper we present a preliminary machine learning readability model for English developed specifically for the needs of adults with autism. We evaluate the model on the ASD corpus, which has been developed specifically for this task and is, so far, the only corpus for which readability for people with autism has been evaluated. The results show that out model outperforms the baseline, which is the widely-used Flesch-Kincaid Grade Level formula.
    • Predicting the difficulty of multiple choice questions in a high-stakes medical exam

      Ha, Le; Yaneva, Victoria; Balwin, Peter; Mee, Janet (Association for Computational Linguistics, 2019-08-02)
      Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs.
    • Predicting the Type and Target of Offensive Posts in Social Media

      Zampieri, Marcos; Malmasi, Shervin; Nakov, Preslav; Rosenthal, Sara; Farra, Noura; Kumar, Ritesh (Association for Computational Linguistics, 2019-06-01)
      As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.
    • Profiling idioms: a sociolexical approach to the study of phraseological patterns

      Moze, Sara; Mohamed, Emad (Springer, 2019-09-18)
      This paper introduces a novel approach to the study of lexical and pragmatic meaning called ‘sociolexical profiling’, which aims at correlating the use of lexical items with author-attributed demographic features, such as gender, age, profession, and education. The approach was applied to a case study of a set of English idioms derived from the Pattern Dictionary of English Verbs (PDEV), a corpus-driven lexical resource which defines verb senses in terms of the phraseological patterns in which a verb typically occurs. For each selected idiom, a gender profile was generated based on data extracted from the Blog Authorship Corpus (BAC) in order to establish whether any statistically significant differences can be detected in the way men and women use idioms in every-day communication. A quantitative and qualitative analysis of the gender profiles was subsequently performed, enabling us to test the validity of the proposed approach. If performed on a large scale, we believe that sociolexical profiling will have important implications for several areas of research, including corpus lexicography, translation, creative writing, forensic linguistics, and natural language processing.
    • Project adaptation over several days

      Blain, Frederic; Hazem, Amir; Bougares, Fethi; Barrault, Loic; Schwenk, Holger (Johannes Gutenberg University of Mainz, 2015-01-30)
    • Pushing the right buttons: adversarial evaluation of quality estimation

      Kanojia, Diptesh; Fomicheva, Marina; Ranasinghe, Tharindu; Blain, Frederic; Orasan, Constantin; Specia, Lucia; Barrault, Loric; Bojar, Ondrej; Bougaris, Fethi; Chatterjee, Rajen; et al. (Association for Computational Linguistics, 2022-01-11)
      Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaningpreserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation.
    • The QT21 combined machine translation system for English to Latvian

      Peter, Jan-Thorsten; Ney, Hermann; Bojar, OndŖej; Pham, Ngoc-Quan; Niehues, Jan; Waibel, Alex; Burlot, Franck; Yvon, François; Pinnis, M arcis; Sics, Valters; et al. (Association for Computational Linguistics, 2017-09)
    • The QT21/HimL combined machine translation system

      Peter, Jan-Thorsten; Alkhouli, Tamer; Ney, Hermann; Huck, Matthias; Braune, Fabienne; Fraser, Alexander; Tamchyna, Aleš; Bojar, OndŖej; Haddow, Barry; Sennrich, Rico; et al. (Association for Computational Linguistics, 2016-08)
      Peter, J.-T., Alkhouli, T., Ney, H., Huck, M. et al. (2016) The QT21/HimL combined machine translation system. In, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Bojar, O., Buck, C., Chatterjee, R., Federmann, C. et al. Stroudsburg, PA: Association for Computational Linguistics, pp. 344-355.
    • Qualitative analysis of post-editing for high quality machine translation

      Blain, Frédéric; Senellart, Jean; Schwenk, Holger; Plitt, Mirko; Roturier, Johann; AAMT, Asia-Pacific Association for Machine Translation (Asia-Pacific Association for Machine Translation, 2011-09-30)
      In the context of massive adoption of Machine Translation (MT) by human localization services in Post-Editing (PE) workflows, we analyze the activity of post-editing high quality translations through a novel PE analysis methodology. We define and introduce a new unit for evaluating post-editing effort based on Post-Editing Action (PEA) - for which we provide human evaluation guidelines and propose a process to automatically evaluate these PEAs. We applied this methodology on data sets from two technologically different MT systems. In that context, we could show that more than 35% of the remaining effort can be saved by introducing of global PEA and edit propagation.
    • Quality in, quality out: learning from actual mistakes

      Blain, Frédéric; Aletras, Nikos; Specia, Lucia (European Association for Machine Translation, 2020)
      Approaches to Quality Estimation (QE) of machine translation have shown promising results at predicting quality scores for translated sentences. However, QE models are often trained on noisy approximations of quality annotations derived from the proportion of post-edited words in translated sentences instead of direct human annotations of translation errors. The latter is a more reliable ground-truth but more expensive to obtain. In this paper, we present the first attempt to model the task of predicting the proportion of actual translation errors in a sentence while minimising the need for direct human annotation. For that purpose, we use transfer-learning to leverage large scale noisy annotations and small sets of high-fidelity human annotated translation errors to train QE models. Experiments on four language pairs and translations obtained by statistical and neural models show consistent gains over strong baselines.
    • Questing for Quality Estimation A User Study

      Escartin, Carla Parra; Béchara, Hanna; Orăsan, Constantin (de Gruyter, 2017-06-06)
      Post-Editing of Machine Translation (MT) has become a reality in professional translation workflows. In order to optimize the management of projects that use post-editing and avoid underpayments and mistrust from professional translators, effective tools to assess the quality of Machine Translation (MT) systems need to be put in place. One field of study that could address this problem is Machine Translation Quality Estimation (MTQE), which aims to determine the quality of MT without an existing reference. Accurate and reliable MTQE can help project managers and translators alike, as it would allow estimating more precisely the cost of post-editing projects in terms of time and adequate fares by discarding those segments that are not worth post-editing (PE) and have to be translated from scratch. In this paper, we report on the results of an impact study which engages professional translators in PE tasks using MTQE. We measured translators? productivity in different scenarios: translating from scratch, post-editing without using MTQE, and post-editing using MTQE. Our results show that QE information, when accurate, improves post-editing efficiency.
    • Reader and author gender and genre in Goodreads

      Thelwall, Mike (Sage, 2017-05-01)
      There are known gender differences in book preferences in terms of both genre and author gender but their extent and causes are not well understood. It is unclear whether reader preferences for author genders occur within any or all genres and whether readers evaluate books differently based on author genders within specific genres. This article exploits a major source of informal book reviews, the Goodreads.com website, to assess the influence of reader and author genders on book evaluations within genres. It uses a quantitative analysis of 201,560 books and their reviews, focusing on the top 50 user-specified genres. The results show strong gender differences in the ratings given by reviewers to books within genres, such as female reviewers rating contemporary romance more highly, with males preferring short stories. For most common book genres, reviewers give higher ratings to books authored by their own gender, confirming that gender bias is not confined to the literary elite. The main exception is the comic book, for which male reviewers prefer female authors, despite their scarcity. A word frequency analysis suggested that authors wrote, and reviewers valued, gendered aspects of books within a genre. For example, relationships and romance were disproportionately mentioned by women in mystery and fantasy novels. These results show that, perhaps for the first time, it is possible to get large scale evidence about the reception of books by typical readers, if they post reviews online.