• Exploiting tweet sentiments in altmetrics large-scale data

      Hassan, Saeed-Ul; Aljohani, Naif Radi; Iqbal Tarar, Usman; Safder, Iqra; Sarwar, Raheem; Alelyani, Salem; Nawaz, Raheel (SAGE, 2022-12-31)
      This article aims to exploit social exchanges on scientific literature, specifically tweets, to analyse social media users' sentiments towards publications within a research field. First, we employ the SentiStrength tool, extended with newly created lexicon terms, to classify the sentiments of 6,482,260 tweets associated with 1,083,535 publications provided by Altmetric.com. Then, we propose harmonic means-based statistical measures to generate a specialized lexicon, using positive and negative sentiment scores and frequency metrics. Next, we adopt a novel article-level summarization approach to domain-level sentiment analysis to gauge the opinion of social media users on Twitter about the scientific literature. Last, we propose and employ an aspect-based analytical approach to mine users' expressions relating to various aspects of the article, such as tweets on its title, abstract, methodology, conclusion, or results section. We show that research communities exhibit dissimilar sentiments towards their respective fields. The analysis of the field-wise distribution of article aspects shows that in Medicine, Economics, Business & Decision Sciences, tweet aspects are focused on the results section. In contrast, Physics & Astronomy, Materials Sciences, and Computer Science these aspects are focused on the methodology section. Overall, the study helps us to understand the sentiments of online social exchanges of the scientific community on scientific literature. Specifically, such a fine-grained analysis may help research communities in improving their social media exchanges about the scientific articles to disseminate their scientific findings effectively and to further increase their societal impact.
    • Author gender identification for Urdu

      Sarwar, Raheem (Springer, 2022-09-30)
      In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machines outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.
    • Natural language processing for mental disorders: an overview

      Calixto, Iacer; Yaneva, Viktoriya; Cardoso, Raphael; Bojar, Ondrej; Dash, Satya Ranjan; Parida, Shantipriya; Tello, Esaú Villatoro; Acharya, Biswaranjan (CRC Press, 2022-09-13)
    • “I don’t think education is the answer”: a corpus-assisted ecolinguistic analysis of plastics discourses in the UK

      Franklin, Emma; Gavins, Joanna; Mehl, Seth (De Gruyter Mouton, 2022-08-15)
      Ecosystems around the world are becoming engulfed in single-use plastics, the majority of which come from plastic packaging. Reusable plastic packaging systems have been proposed in response to this plastic waste crisis, but uptake of such systems in the UK is still very low. This article draws on a thematic corpus of 5.6 million words of UK English around plastics, packaging, reuse, and recycling to examine consumer attitudes towards plastic (re)use. Utilizing methods and insights from ecolinguistics, corpus linguistics, and cognitive linguistics, this article assesses to what degree consumer language differs from that of public-facing bodies such as supermarkets and government entities. A predefined ecosophy, prioritizing protection, rights, systems thinking, and fairness, is used to not only critically evaluate narratives in plastics discourse but also to recommend strategies for more effective and ecologically beneficial communications around plastics and reuse. This article recommends the adoption of ecosophy in multidisciplinary project teams, and argues that ecosophies are conducive to transparent and reproducible discourse analysis. The analysis also suggests that in order to make meaningful change in packaging reuse behaviors, it is highly likely that deeply ingrained cultural stories around power, rights, and responsibilities will need to be directly challenged.
    • The USMLE® Step 2 clinical skills patient note corpus

      Yaneva, Victoria; Mee, Janet; Ha, Le An; Harik, Polina; Jodoin, Michael; Mechaber, Alex (Association for Computational Linguistics, 2022-07-31)
      This paper presents a corpus of 43,985 clinical patient notes (PNs) written by 35,156 examinees during the high-stakes USMLE® Step 2 Clinical Skills examination. In this exam, examinees interact with standardized patients - people trained to portray simulated scenarios called clinical cases. For each encounter, an examinee writes a PN, which is then scored by physician raters using a rubric of clinical concepts, expressions of which should be present in the PN. The corpus features PNs from 10 clinical cases, as well as the clinical concepts from the case rubrics. A subset of 2,840 PNs were annotated by 10 physician experts such that all 143 concepts from the case rubrics (e.g., shortness of breath) were mapped to 34,660 PN phrases (e.g., dyspnea, difficulty breathing). The corpus is available via a data sharing agreement with NBME and can be requested at https://www.nbme.org/services/data-sharing.
    • TurkishDelightNLP: A neural Turkish NLP toolkit

      Aleçakır, Hüseyin; Bölücü, Necva; Can, Burcu (ACL, 2022-07-01)
      We introduce a neural Turkish NLP toolkit called TurkishDelightNLP that performs computational linguistic analyses from morphological level to semantic level that involves tasks such as stemming, morphological segmentation, morphological tagging, part-of-speech tagging, dependency parsing, and semantic parsing, as well as high-level NLP tasks such as named entity recognition. We publicly share the open-source Turkish NLP toolkit through a web interface that allows an input text to be analysed in real-time, as well as the open source implementation of the components provided in the toolkit, an API, and several annotated datasets such as word similarity test set to evaluate word embeddings and UCCA-based semantic annotation in Turkish. This will be the first open-source Turkish NLP toolkit that involves a range of NLP tasks in all levels. We believe that it will be useful for other researchers in Turkish NLP and will be also beneficial for other high-level NLP tasks in Turkish.
    • Source language difficulties in learner translation: Evidence from an error-annotated corpus

      Kunilovskaia, Mariia; Ilyushchenya, Tatyana; Morgoun, Natalia; Mitkov, Ruslan (John Benjamins Publishing, 2022-06-03)
      This study uses an error-annotated, mass-media subset of a sentence-aligned, multi-parallel learner translator corpus, to reveal source language items that are challenging in English-to-Russian translation. Our data includes multiple translations to most challenging source sentences, distilled from a large collection of student translations on the basis of error statistics. This sample was subjected to manual contrastive-comparative analysis, which resulted in a list of English items that were difficult to students. The outcome of the analysis was compared to the topics discussed in dozens of translation textbooks that are recommended to BA and specialist-degree students in Russia at the initial stage of professional education. We discuss items that deserve more prominence in training as well as items that call for improvements to traditional learning activities. This study presents evidence that a more empirically-motivated design of practical translation syllabus as part of translator education is required.
    • Turkish universal conceptual cognitive annotation

      Bölücü, Necva; Can, Burcu; Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; et al. (European Language Resources Association, 2022-06-01)
      Universal Conceptual Cognitive Annotation (UCCA) is a cross-lingual semantic annotation framework that provides an easy annotation without any requirement for linguistic background. UCCA-annotated datasets have been already released in English, French, and German. In this paper, we introduce the first UCCA-annotated Turkish dataset that currently involves 50 sentences obtained from the METU-Sabanci Turkish Treebank. We followed a semi-automatic annotation approach, where an external semantic parser is utilised for an initial annotation of the dataset, which is partially accurate and requires refinement. We manually revised the annotations obtained from the semantic parser that are not in line with the UCCA rules that we defined for Turkish. We used the same external semantic parser for evaluation purposes and conducted experiments with both zero-shot and few-shot learning. This is the initial version of the annotated dataset and we are currently extending the dataset. We are releasing the current Turkish UCCA annotation guideline along with the annotated dataset.
    • You are driving me up the wall! A corpus-based study of a special class of resultative constructions

      Corpas Pastor, Gloria (Université Jean Moulin - Lyon 3, 2022-03-26)
      This paper focuses on resultative constructions from a computational and corpus-based approach. We claim that the array of expressions (traditionally classed as idioms, collocations, free word combinations, etc.) that are used to convey a person’s change of mental state (typically negative) are basically instances of the same resultative construction. The first part of the study will introduce basic tenets of Construction Grammar and resultatives. Then, our corpus-based methodology will be spelled out, including a description of the two giga-token corpora used and a detailed account of our protocolised heuristic strategies and tasks. Distributional analysis of matrix slot fillers will be presented next, together with a discussion on restrictions, novel instances, and productivity. A final section will round up our study, with special attention to notions like “idiomaticity”, “productivity” and “variability” of the pairings of form and meaning analysed. To the best of our knowledge, this is one of the first studies based on giga-token corpora that explores idioms as integral parts of higher-order resultative constructions.
    • Predicting lexical complexity in English texts: the Complex 2.0 dataset

      Shardlow, Matthew; Evans, Richard; Zampieri, Marcos (Springer, 2022-03-23)
      Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
    • Joint learning of morphology and syntax with cross-level contextual information flow

      Can Buglalilar, Burcu; Aleçakır, Hüseyin; Manandhar, Suresh; Bozşahin, Cem (Cambridge University Press, 2022-01-20)
      We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor (2018) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for future comparison.
    • A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English

      Sivakumar, Jasivan; Muga, Jake; Spadavecchia, Flavio; White, Daniel; Can Buglalilar, Burcu (IEEE, 2022-01-20)
      In this study, we propose a Gated Recurrent Unit (GRU) model to restore the following features: word and sentence boundaries, periods, commas, and capitalisation for unformatted English text. We approach feature restoration as a binary classification task where the model learns to predict whether a feature should be restored or not. A pipeline approach is proposed, in which only one feature (word boundary, sentence boundary, punctuation, capitalisation) is restored in each component of the pipeline model. To optimise the model, we conducted a grid search on the parameters. The effect of changing the order of the pipeline is also investigated experimentally; PERIODS > COMMAS > SPACES > CASING yielded the best result. Our findings highlight several specifcaction points with optimisation potential to be targeted in follow-up research.
    • Author verification of Nahj Al-Balagha

      Sarwar, Raheem; Mohamed, Emad (Oxford University Press, 2022-01-20)
      The primary purpose of this paper is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi’i Muslims are proposing different theories. Given the morphologically complex nature of Arabic, we test whether morphological segmentation, applied to the book and works by the two authors suspected by Sunnis to have authored the texts, can be used for author verification of the Nahj Al-Balagha. Our findings indicate that morphological segmentation may lead to slightly better results than whole words, and that regardless of the feature sets, the three sub-corpora cluster into three distinct groups using Principal Component Analysis, Hierarchical Clustering, Multi-dimensional Scaling and Bootstrap Consensus Trees. Supervised classification methods such as Naive Bayes, Support Vector Machines, k Nearest Neighbours, Random Forests, AdaBoost, Bagging and Decision Trees confirm the same results, which is a clear indication that (a) the book is internally consistent and can thus be attributed to a single person, and (b) it was not authored by either of the suspected authors.
    • Pushing the right buttons: adversarial evaluation of quality estimation

      Kanojia, Diptesh; Fomicheva, Marina; Ranasinghe, Tharindu; Blain, Frederic; Orasan, Constantin; Specia, Lucia; Barrault, Loric; Bojar, Ondrej; Bougaris, Fethi; Chatterjee, Rajen; et al. (Association for Computational Linguistics, 2022-01-11)
      Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaningpreserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation.
    • Technology solutions for interpreters: the VIP system

      Corpas Pastor, Gloria (Universidad de Valladolid, 2022-01-07)
      Interpreting technologies have abruptly entered the profession in recent years. However, technology still remains a relatively marginal topic of academic debate, although interest in developing tailor-made solutions for interpreters has risen sharply. This paper presents the VIP system, one of the research outputs of the homonymous project VIP - Voice-text Integrated system for interPreters, and its continuation (VIP II). More specifically, a technology-based terminology workflow for simultaneous interpretation is presented.
    • Overview of the HASOC subtrack at FIRE 2021: Hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech

      Mandl, Thomas; Modha, Sandip; Shahi, Gautam Kishore; Madhu, Hiren; Satapara, Shrey; Majumder, Prasenjit; Schäfer, Johannes; Ranasinghe, Tharindu; Zampieri, Marcos; Nandini, Durgesh; et al. (Association for Computing Machinery, 2021-12-13)
      The HASOC track is dedicated to the evaluation of technology for finding Offensive Language and Hate Speech. HASOC is creating a multilingual data corpus mainly for English and under-resourced languages(Hindi and Marathi). This paper presents one HASOC subtrack with two tasks. In 2021, we organized the classification task for English, Hindi, and Marathi. The first task consists of two classification tasks; Subtask 1A consists of a binary and fine-grained classification into offensive and non-offensive tweets. Subtask 1B asks to classify the tweets into Hate, Profane and offensive. Task 2 consists of identifying tweets given additional context in the form of the preceding conversion. During the shared task, 65 teams have submitted 652 runs. This overview paper briefly presents the task descriptions, the data and the results obtained from the participant's submission.
    • Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization

      Safder, Iqra; Batool, Hafsa; Sarwar, Raheem; Zaman, Farooq; Aljohani, Naif Radi; Nawaz, Raheel; Gaber, Mohamed; Hassan, Saeed-Ul (Taylor & Francis, 2021-11-14)
      Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function's graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/.
    • Linguistic features evaluation for hadith authenticity through automatic machine learning

      Mohamed, Emad; Sarwar, Raheem (Oxford University Press, 2021-11-13)
      There has not been any research that provides an evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith authenticity, and there is a need to build a “gold standard” corpus for good practices in Hadith authentication. We write a scraper in Python programming language and collect a corpus of 3651 authentic prophetic traditions and 3593 fake ones. We process the corpora with morphological segmentation and perform extensive experimental studies using a variety of machine learning algorithms, mainly through Automatic Machine Learning, to distinguish between these two categories. With a feature set including words, morphological segments, characters, top N words, top N segments, function words and several vocabulary richness features, we analyse the results in terms of both prediction and interpretability to explain which features are more characteristic of each class. Many experiments have produced good results and the highest accuracy (i.e., 78.28%) is achieved using word n-grams as features using the Multinomial Naive Bayes classifier. Our extensive experimental studies conclude that, at least for Digital Humanities, feature engineering may still be desirable due to the high interpretability of the features. The corpus and software (scripts) will be made publicly available to other researchers in an effort to promote progress and replicability.
    • Findings of the WMT 2021 shared task on quality estimation

      Specia, Lucia; Blain, Frederic; Fomicheva, Marina; Zerva, Chrysoula; Li, Zhenhao; Chaudhary, Vishrav; Martins, André (Association for Computational Linguistics, 2021-11-10)
      We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels. This edition focused on two main novel additions: (i) prediction for unseen languages, i.e. zero-shot settings, and (ii) prediction of sentences with catastrophic errors. In addition, new data was released for a number of languages, especially post-edited data. Participating teams from 19 institutions submitted altogether 1263 systems to different task variants and language pairs.