• Exploiting tweet sentiments in altmetrics large-scale data

      Hassan, Saeed-Ul; Aljohani, Naif Radi; Iqbal Tarar, Usman; Safder, Iqra; Sarwar, Raheem; Alelyani, Salem; Nawaz, Raheel (SAGE, 2022-12-31)
      This article aims to exploit social exchanges on scientific literature, specifically tweets, to analyse social media users' sentiments towards publications within a research field. First, we employ the SentiStrength tool, extended with newly created lexicon terms, to classify the sentiments of 6,482,260 tweets associated with 1,083,535 publications provided by Altmetric.com. Then, we propose harmonic means-based statistical measures to generate a specialized lexicon, using positive and negative sentiment scores and frequency metrics. Next, we adopt a novel article-level summarization approach to domain-level sentiment analysis to gauge the opinion of social media users on Twitter about the scientific literature. Last, we propose and employ an aspect-based analytical approach to mine users' expressions relating to various aspects of the article, such as tweets on its title, abstract, methodology, conclusion, or results section. We show that research communities exhibit dissimilar sentiments towards their respective fields. The analysis of the field-wise distribution of article aspects shows that in Medicine, Economics, Business & Decision Sciences, tweet aspects are focused on the results section. In contrast, Physics & Astronomy, Materials Sciences, and Computer Science these aspects are focused on the methodology section. Overall, the study helps us to understand the sentiments of online social exchanges of the scientific community on scientific literature. Specifically, such a fine-grained analysis may help research communities in improving their social media exchanges about the scientific articles to disseminate their scientific findings effectively and to further increase their societal impact.
    • Natural language processing for mental disorders: an overview

      Calixto, Iacer; Yaneva, Viktoriya; Cardoso, Raphael (CRC Press, 2022-12-31)
    • Source language difficulties in learner translation: Evidence from an error-annotated corpus

      Kunilovskaia, Mariia; Ilyushchenya, Tatyana; Morgoun, Natalia; Mitkov, Ruslan (John Benjamins Publishing, 2022-06-30)
      This study uses an error-annotated, mass-media subset of a sentence-aligned, multi-parallel learner translator corpus, to reveal source language items that are challenging in English-to-Russian translation. Our data includes multiple translations to most challenging source sentences, distilled from a large collection of student translations on the basis of error statistics. This sample was subjected to manual contrastive-comparative analysis, which resulted in a list of English items that were difficult to students. The outcome of the analysis was compared to the topics discussed in dozens of translation textbooks that are recommended to BA and specialist-degree students in Russia at the initial stage of professional education. We discuss items that deserve more prominence in training as well as items that call for improvements to traditional learning activities. This study presents evidence that a more empirically-motivated design of practical translation syllabus as part of translator education is required.
    • A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English

      Sivakumar, Jasivan; Muga, Jake; Spadavecchia, Flavio; White, Daniel; Can Buglalilar, Burcu (IEEE, 2022-06-30)
      In this study, we propose a Gated Recurrent Unit (GRU) model to restore the following features: word and sentence boundaries, periods, commas, and capitalisation for unformatted English text. We approach feature restoration as a binary classification task where the model learns to predict whether a feature should be restored or not. A pipeline approach is proposed, in which only one feature (word boundary, sentence boundary, punctuation, capitalisation) is restored in each component of the pipeline model. To optimise the model, we conducted a grid search on the parameters. The effect of changing the order of the pipeline is also investigated experimentally; PERIODS > COMMAS > SPACES > CASING yielded the best result. Our findings highlight several specifcaction points with optimisation potential to be targeted in follow-up research.
    • Joint learning of morphology and syntax with cross-level contextual information flow

      Can Buglalilar, Burcu; Aleçakır, Hüseyin; Manandhar, Suresh; Bozşahin, Cem (Cambridge University Press, 2022-01-20)
      We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor (2018) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for future comparison.
    • Author verification of Nahj Al-Balagha

      Sarwar, Raheem; Mohamed, Emad (Oxford University Press, 2022-01-20)
      The primary purpose of this paper is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi’i Muslims are proposing different theories. Given the morphologically complex nature of Arabic, we test whether morphological segmentation, applied to the book and works by the two authors suspected by Sunnis to have authored the texts, can be used for author verification of the Nahj Al-Balagha. Our findings indicate that morphological segmentation may lead to slightly better results than whole words, and that regardless of the feature sets, the three sub-corpora cluster into three distinct groups using Principal Component Analysis, Hierarchical Clustering, Multi-dimensional Scaling and Bootstrap Consensus Trees. Supervised classification methods such as Naive Bayes, Support Vector Machines, k Nearest Neighbours, Random Forests, AdaBoost, Bagging and Decision Trees confirm the same results, which is a clear indication that (a) the book is internally consistent and can thus be attributed to a single person, and (b) it was not authored by either of the suspected authors.
    • Pushing the right buttons: adversarial evaluation of quality estimation

      Kanojia, Diptesh; Fomicheva, Marina; Ranasinghe, Tharindu; Blain, Frederic; Orasan, Constantin; Specia, Lucia; Barrault, Loric; Bojar, Ondrej; Bougaris, Fethi; Chatterjee, Rajen; et al. (Association for Computational Linguistics, 2022-01-11)
      Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaningpreserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation.
    • Technology solutions for interpreters: the VIP system

      Corpas Pastor, Gloria (Universidad de Valladolid, 2022-01-07)
      Interpreting technologies have abruptly entered the profession in recent years. However, technology still remains a relatively marginal topic of academic debate, although interest in developing tailor-made solutions for interpreters has risen sharply. This paper presents the VIP system, one of the research outputs of the homonymous project VIP - Voice-text Integrated system for interPreters, and its continuation (VIP II). More specifically, a technology-based terminology workflow for simultaneous interpretation is presented.
    • Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization

      Safder, Iqra; Batool, Hafsa; Sarwar, Raheem; Zaman, Farooq; Aljohani, Naif Radi; Nawaz, Raheel; Gaber, Mohamed; Hassan, Saeed-Ul (Taylor & Francis, 2021-11-14)
      Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function's graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/.
    • Linguistic features evaluation for hadith authenticity through automatic machine learning

      Mohamed, Emad; Sarwar, Raheem (Oxford University Press, 2021-11-13)
      There has not been any research that provides an evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith authenticity, and there is a need to build a “gold standard” corpus for good practices in Hadith authentication. We write a scraper in Python programming language and collect a corpus of 3651 authentic prophetic traditions and 3593 fake ones. We process the corpora with morphological segmentation and perform extensive experimental studies using a variety of machine learning algorithms, mainly through Automatic Machine Learning, to distinguish between these two categories. With a feature set including words, morphological segments, characters, top N words, top N segments, function words and several vocabulary richness features, we analyse the results in terms of both prediction and interpretability to explain which features are more characteristic of each class. Many experiments have produced good results and the highest accuracy (i.e., 78.28%) is achieved using word n-grams as features using the Multinomial Naive Bayes classifier. Our extensive experimental studies conclude that, at least for Digital Humanities, feature engineering may still be desirable due to the high interpretability of the features. The corpus and software (scripts) will be made publicly available to other researchers in an effort to promote progress and replicability.
    • Findings of the WMT 2021 shared task on quality estimation

      Specia, Lucia; Blain, Frederic; Fomicheva, Marina; Zerva, Chrysoula; Li, Zhenhao; Chaudhary, Vishrav; Martins, André (Association for Computational Linguistics, 2021-11-10)
      We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels. This edition focused on two main novel additions: (i) prediction for unseen languages, i.e. zero-shot settings, and (ii) prediction of sentences with catastrophic errors. In addition, new data was released for a number of languages, especially post-edited data. Participating teams from 19 institutions submitted altogether 1263 systems to different task variants and language pairs.
    • deepQuest-py: large and distilled models for quality estimation

      Alva-Manchego, Fernando; Obamuyide, Abiola; Gajbhiye, Amit; Blain, Frederic; Fomicheva, Marina; Specia, Lucia; Adel, Heike; Shi, Shuming (Association for Computational Linguistics, 2021-11-01)
      We introduce deepQuest-py, a framework for training and evaluation of large and lightweight models for Quality Estimation (QE). deepQuest-py provides access to (1) state-ofthe-art models based on pre-trained Transformers for sentence-level and word-level QE; (2) light-weight and efficient sentence-level models implemented via knowledge distillation; and (3) a web interface for testing models and visualising their predictions. deepQuestpy is available at https://github.com/ sheffieldnlp/deepQuest-py under a CC BY-NC-SA licence.
    • Robust fragment-based framework for cross-lingual sentence retrieval

      Trijakwanich, Nattapol; Limkonchotiwat, Peerat; Sarwar, Raheem; Phatthiyaphaibun, Wannaphong; Chuangsuwanich, Ekapol; Nutanong, Sarana; Moens, Marie-Francine; Huan, Xuanjing; Specia, Lucia; Yih, Scott Wen-tau (Association for Computational Linguistics, 2021-11-01)
      Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of- Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.
    • Urdu AI: writeprints for Urdu authorship identification

      Sarwar, Raheem; Hassan, Saeed-Ul (Association for Computing Machinery, 2021-10-31)
      The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.
    • Translationese and register variation in English-to-Russian professional translation

      Kunilovskaya, Maria; Corpas Pastor, Gloria; Wang, Vincent; Lim, Lily; Li, Defeng (Springer Singapore, 2021-10-12)
      This study explores the impact of register on the properties of translations. We compare sources, translations and non-translated reference texts to describe the linguistic specificity of translations common and unique between four registers. Our approach includes bottom-up identification of translationese effects that can be used to define translations in relation to contrastive properties of each register. The analysis is based on an extended set of features that reflect morphological, syntactic and text-level characteristics of translations. We also experiment with lexis-based features from n-gram language models estimated on large bodies of originally- authored texts from the included registers. Our parallel corpora are built from published English-to-Russian professional translations of general domain mass-media texts, popular-scientific books, fiction and analytical texts on political and economic news. The number of observations and the data sizes for parallel and reference components are comparable within each register and range from 166 (fiction) to 525 (media) text pairs; from 300,000 to 1 million tokens. Methodologically, the research relies on a series of supervised and unsupervised machine learning techniques, including those that facilitate visual data exploration. We learn a number of text classification models and study their performance to assess our hypotheses. Further on, we analyse the usefulness of the features for these classifications to detect the best translationese indicators in each register. The multivariate analysis via text classification is complemented by univariate statistical analysis which helps to explain the observed deviation of translated registers through a number of translationese effects and detect the features that contribute to them. Our results demonstrate that each register generates a unique form of translationese that can be only partially explained by cross-linguistic factors. Translated registers differ in the amount and type of prevalent translationese. The same translationese tendencies in different registers are manifested through different features. In particular, the notorious shining-through effect is more noticeable in general media texts and news commentary and is less prominent in fiction.
    • Extracción de fraseología para intérpretes a partir de corpus comparables compilados mediante reconocimiento automático del habla

      Corpas Pastor, Gloria; Gaber, Mahmoud; Corpas Pastor, Gloria; Bautista Zambrana, María Rosario; Hidalgo Ternero, Carlos Manuel (Editorial Comares, 2021-10-04)
      Today, automatic speech recognition is beginning to emerge strongly in the field of interpreting. Recent studies point to this technology as one of the main documentation resources for interpreters, among other possible uses. In this paper we present a novel documentation methodology that involves semi-automatic compilation of comparable corpora (transcriptions of oral speeches) and automatic corpus compilation of written documents on the same topic with a view to preparing an interpreting assignment. As a convenient background, we provide a brief overview of the use of automatic speech recognition in the context of interpreting technologies. Next, we will detail the protocol for designing and compiling our comparable corpora that we will exploit for analysis. In the last part of the paper, we will cover phraseology extraction and study some collocational patterns in both corpora. Mastering the specific phraseology of the specific subject matter of the assignment is one of the main stumbling blocks that interpreters face in their daily work. Our ultimate aim is to establish whether oral corpora could be of further benefit to the interpreter in the preliminary preparation phase.
    • A sequence labelling approach for automatic analysis of ello: tagging pronouns, antecedents, and connective phrases

      Parodi, Giovanni; Evans, Richard; Ha, Le An; Mitkov, Ruslan; Julio, Cristóbal; Olivares-López, Raúl Ignacio (Springer, 2021-09-04)
      Encapsulators are linguistic units which establish coherent referential connections to the preceding discourse in a text. In this paper, we address the challenge of automatically analysing the pronominal encapsulator ello in Spanish text. Our method identifies, for each occurrence, the antecedent of the pronoun (including its grammatical type), the connective phrase which combines with the pronoun to express a discourse relation linking the antecedent text segment to the following text segment, and the type of semantic relation expressed by the complex discourse marker formed by the connective phrase and pronoun. We describe our annotation of a corpus to inform the development of our method and to finetune an automatic analyser based on bidirectional encoder representation transformers (BERT). On testing our method, we find that it performs with greater accuracy than three baselines (0.76 for the resolution task), and sets a promising benchmark for the automatic annotation of occurrences of the pronoun ello, their antecedents, and the semantic relations between the two text segments linked by the connective in combination with the pronoun.
    • An exploratory analysis of multilingual word-level quality estimation with cross-lingual transformers

      Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2021-08-31)
      Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that these QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.
    • Handling cross and out-of-domain samples in Thai word segmentation

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-08-01)
      While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.