• Exploiting tweet sentiments in altmetrics large-scale data

      Hassan, Saeed-Ul; Aljohani, Naif Radi; Iqbal Tarar, Usman; Safder, Iqra; Sarwar, Raheem; Alelyani, Salem; Nawaz, Raheel (SAGE, 2021-12-31)
      This article aims to exploit social exchanges on scientific literature, specifically tweets, to analyse social media users' sentiments towards publications within a research field. First, we employ the SentiStrength tool, extended with newly created lexicon terms, to classify the sentiments of 6,482,260 tweets associated with 1,083,535 publications provided by Altmetric.com. Then, we propose harmonic means-based statistical measures to generate a specialized lexicon, using positive and negative sentiment scores and frequency metrics. Next, we adopt a novel article-level summarization approach to domain-level sentiment analysis to gauge the opinion of social media users on Twitter about the scientific literature. Last, we propose and employ an aspect-based analytical approach to mine users' expressions relating to various aspects of the article, such as tweets on its title, abstract, methodology, conclusion, or results section. We show that research communities exhibit dissimilar sentiments towards their respective fields. The analysis of the field-wise distribution of article aspects shows that in Medicine, Economics, Business & Decision Sciences, tweet aspects are focused on the results section. In contrast, Physics & Astronomy, Materials Sciences, and Computer Science these aspects are focused on the methodology section. Overall, the study helps us to understand the sentiments of online social exchanges of the scientific community on scientific literature. Specifically, such a fine-grained analysis may help research communities in improving their social media exchanges about the scientific articles to disseminate their scientific findings effectively and to further increase their societal impact.
    • Natural language processing for mental disorders: an overview

      Calixto, Iacer; Yaneva, Viktoriya; Cardoso, Raphael (CRC Press, 2021-12-31)
    • Urdu AI: writeprints for Urdu authorship identification

      Sarwar, Raheem; Hassan, Saeed-Ul (Association for Computing Machinery, 2021-12-31)
      The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.
    • Linguistic features evaluation for hadith authenticity through automatic machine learning

      Mohamed, Emad; Sarwar, Raheem (Oxford University Press, 2021-12-31)
      There has not been any research that provides an evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith authenticity, and there is a need to build a “gold standard” corpus for good practices in Hadith authentication. We write a scraper in Python programming language and collect a corpus of 3651 authentic prophetic traditions and 3593 fake ones. We process the corpora with morphological segmentation and perform extensive experimental studies using a variety of machine learning algorithms, mainly through Automatic Machine Learning, to distinguish between these two categories. With a feature set including words, morphological segments, characters, top N words, top N segments, function words and several vocabulary richness features, we analyse the results in terms of both prediction and interpretability to explain which features are more characteristic of each class. Many experiments have produced good results and the highest accuracy (i.e., 78.28%) is achieved using word n-grams as features using the Multinomial Naive Bayes classifier. Our extensive experimental studies conclude that, at least for Digital Humanities, feature engineering may still be desirable due to the high interpretability of the features. The corpus and software (scripts) will be made publicly available to other researchers in an effort to promote progress and replicability.
    • Robust fragment-based framework for cross-lingual sentence retrieval

      Trijakwanich, Nattapol; Limkonchotiwat, Peerat; Sarwar, Raheem; Phatthiyaphaibun, Wannaphong; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-12-31)
      Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of- Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.
    • A sequence labelling approach for automatic analysis of ello: tagging pronouns, antecedents, and connective phrases

      Parodi, Giovanni; Evans, Richard; Ha, Le An; Mitkov, Ruslan; Julio, Cristóbal; Olivares-López, Raúl Ignacio (Springer, 2021-09-04)
      Encapsulators are linguistic units which establish coherent referential connections to the preceding discourse in a text. In this paper, we address the challenge of automatically analysing the pronominal encapsulator ello in Spanish text. Our method identifies, for each occurrence, the antecedent of the pronoun (including its grammatical type), the connective phrase which combines with the pronoun to express a discourse relation linking the antecedent text segment to the following text segment, and the type of semantic relation expressed by the complex discourse marker formed by the connective phrase and pronoun. We describe our annotation of a corpus to inform the development of our method and to finetune an automatic analyser based on bidirectional encoder representation transformers (BERT). On testing our method, we find that it performs with greater accuracy than three baselines (0.76 for the resolution task), and sets a promising benchmark for the automatic annotation of occurrences of the pronoun ello, their antecedents, and the semantic relations between the two text segments linked by the connective in combination with the pronoun.
    • SemEval-2021 task 1: Lexical complexity prediction

      Shardlow, Matthew; Evans, Richard; Paetzold, Gustavo Henrique; Zampieri, Marcos (Association for Computational Linguistics, 2021-08-01)
      This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.
    • Handling cross and out-of-domain samples in Thai word segmentation

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-08-01)
      While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.
    • Knowledge distillation for quality estimation

      Gajbhiye, Amit; Fomicheva, Marina; Alva-Manchego, Fernando; Blain, Frederic; Obamuyide, Abiola; Aletras, Nikolaos; Specia, Lucia (Association for Computational Linguistics, 2021-08-01)
      Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models trained on distilled pre-trained representations remain prohibitively large for many usage scenarios. We instead propose to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
    • Combining text and images for film age appropriateness classification

      Ha, Le; Mohamed, Emad (Elsevier, 2021-07-14)
      We combine textual information from a corpus of film scripts and the images of important scenes from IMDB that correspond to these films to create a bimodal dataset (the dataset and scripts can be obtained from https://tinyurl.com/se9tlmr) for film age appropriateness classification with the objective of improving the prediction of age appropriateness for parents and children. We use state-of-the art Deep Learning image feature extraction, including DENSENet, ResNet, Inception, and NASNet. We have tested several Machine learning algorithms and have found xgboost to yield the best results. Previously reported classification accuracy, using only textual features, were 79.1% and 65.3% for American MPAA and British BBFC classification respectively. Using images alone, we achieve 64.8% and 56.7% classification accuracy. The most consistent combination of textual features and images’ features achieves 81.1% and 66.8%, both statistically significant improvements over the use of text only.
    • Sentiment analysis for Urdu online reviews using deep learning models

      Safder, Iqra; Mehmood, Zainab; Sarwar, Raheem; Hassan, Saeed-Ul; Zaman, Farooq; Adeel Nawab, Rao Muhammad; Bukhari, Faisal; Ayaz Abbasi, Rabeeh; Alelyani, Salem; Radi Aljohani, Naif; et al. (Wiley, 2021-06-28)
      Most existing studies are focused on popular languages like English, Spanish, Chinese, Japanese, and others, however, limited attention has been paid to Urdu despite having more than 60 million native speakers. In this paper, we develop a deep learning model for the sentiments expressed in this under-resourced language. We develop an open-source corpus of 10,008 reviews from 566 online threads on the topics of sports, food, software, politics, and entertainment. The objectives of this work are bi-fold (1) the creation of a human-annotated corpus for the research of sentiment analysis in Urdu; and (2) measurement of up-to-date model performance using a corpus. For their assessment, we performed binary and ternary classification studies utilizing another model, namely LSTM, RCNN Rule-Based, N-gram, SVM, CNN, and LSTM. The RCNN model surpasses standard models with 84.98 % accuracy for binary classification and 68.56 % accuracy for ternary classification. To facilitate other researchers working in the same domain, we have open-sourced the corpus and code developed for this research.
    • Decálogo de características de la literatura poscolonial: propuesta de una taxonomía para la crítica literaria y los estudios de literatura comparada

      Fernández Ruiz, María Remedios; Corpas Pastor, Gloria; Seghiri, Míriam (Editorial CSIC, 2021-06-22)
      El objetivo de este artículo es ofrecer una propuesta de clasificación de los rasgos presentes, en mayor o menor medida, en la literatura poscolonial en cualquier idioma. A pesar de que esta taxonomía toma como punto de partida definiciones teóricas previas de los conceptos clave relacionados con la literatura poscolonial (Edwards 2008, Nayar 2008 y Ramone 2011), parece ser la primera clasificación formal que se ha elaborado al respecto. De este modo, se analizan conceptos consolidados a la par que presenta la nueva noción de plasticidad de géneros literarios y explora las corrientes actuales en la investigación de la interseccionalidad. Como resultado, proporcionaremos un decálogo de características de la literatura poscolonial que favorecerá la crítica literaria y los estudios de literatura comparada.
    • Management of 201 individuals with emotionally unstable personality disorders: A naturalistic observational study in real-world inpatient setting

      Shahpesandy, Homayun; Mohammed-Ali, Rosemary; Oakes, Michael; Al-Kubaisy, Tarik; Cheetham, Anna; Anene, Moses; The Hartsholme Centre, Long Leys Road, Lincoln, LN1 1FS, Lincolnshire NHS Foundation Trust, UK. (Maghira & Maas Publications, 2021-06-03)
      BACKGROUND: Emotionally unstable personality disorder (EUPD) is a challenging condition with a prevalence of 20% in inpatient services. Psychotherapy is the preferred treatment; nevertheless, off-license medications are widely used. OBJECTIVES: To identify socio-demographics, clinical and service-delivery characteristics of people with EUPD admitted to inpatient services between 1st January 2017 and 31st December 2018. METHODS: A retrospective review using data from patients' records. Individuals, age 18-65 were included. Statistical analysis was conducted by the Mann-Whitney-Wilcoxon test and Chi-squared test with Yates's continuity correction. RESULTS: Of 1646 inpatients, 201 (12.2%); had the diagnosis of EUPD; 133 (66.0%) women, 68 (44.0%). EUPD was significantly (P < .001) more prevalent in women (18.2%) than men (7.4%). EUPD patients were significantly (P < .001) younger (32.2 years) than patients without EUPD (46 years), and had significantly (P < .001) more admissions (1.74) than patients without EUPD (1.2 admission). 70.5% of patients had one and 17.0% two Axis-I psychiatric co-morbidities. Substance use was significantly (P < .001) more often in men (57.3%) than in women (28.5%). Significantly (P = 0.047) more women (68.4%) than men (53.0%) reported sexual abuse. 87.5% used polypharmacy. Antidepressants were significantly (P = 0.02) often prescribed to women (76.6%) than men (69.1%). Significantly (P = 0.02) more women (83.5%) than men (67.6%) were on antipsychotics. 57.2% of the patients were on anxiolytics, 40.0% on hypnotics and 25.8% on mood stabilisers. CONCLUSION: EUPD is a complex condition with widespread comorbidity. The term EUPD, Borderline Personality Disorder is unsuitable, stigmatising and too simplistic to reflect the nature, gravity and psychopathology of this syndrome.
    • Backtranslation feedback improves user confidence in MT, not quality

      Zouhar, Vilém; Novák, Michal; Žilinec, Matúš; Bojar, Ondřej; Obregón, Mateo; Hill, Robin L; Blain, Frédéric; Fomicheva, Marina; Specia, Lucia; Yankovskaya, Lisa; et al. (Association for Computational Linguistics, 2021-06-01)
      Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.
    • Herramientas y recursos electrónicos para la traducción de la manipulación fraseológica: un estudio de caso centrado en el estudiante

      Hidalgo Ternero, Carlos Manuel; Corpas Pastor, Gloria (Ediciones Universidad de Salamanca, 2021-05-13)
      En el presente artículo se analiza un estudio de caso llevado a cabo con estudiantes de la asignatura Traducción General «BA-AB» (II) - Inglés-Español / EspañolInglés, impartida en el segundo semestre del segundo curso del Grado en Traducción e Interpretación de la Universidad de Málaga. En él, en una primera fase, se les enseñó a los estudiantes cómo sacar el máximo partido de diferentes recursos y herramientas documentales electrónicos (corpus lingüísticos, recursos lexicográficos o la web, entre otros) para la creación de equivalencias textuales en aquellos casos en los que, fruto del anisomorfismo fraseológico interlingüe, la modificación creativa de unidades fraseológicas (UF) en el texto origen y la ausencia de correspondencias biunívocas presentan serias dificultades para el proceso traslaticio. De esta manera, a una primera actividad formativa sobre la traducción de usos creativos de unidades fraseológicas le sucede una sesión práctica en la que los alumnos tuvieron que enfrentarse a distintos casos de manipulación en el texto origen. Con el análisis de dichos resultados se podrá vislumbrar en qué medida los distintos recursos documentales ayudan a los traductores en formación a superar el desafío de la manipulación fraseológica
    • A cascaded unsupervised model for PoS tagging

      Bölücü, Necva; Can, Burcu (ACM, 2021-03-31)
      Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing (NLP), that assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective etc). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g. dependency parsing) and thereby extract the meaning of the sentence (e.g. semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.
    • Constructional idioms of ‘insanity’ in English and Spanish: A corpus-based study

      Corpas Pastor, Gloria (Elsevier, 2021-02-10)
      This paper presents a corpus-based study of constructions in English and Spanish, with a special emphasis on equivalent semantic-functional counterparts, and potential mismatches. Although usage/corpus-based Construction Grammar (CxG) has attracted much attention in recent years, most studies have dealt exclusively with monolingual constructions. In this paper we will focus on two constructions that represent conventional ways to express ‘insanity’ in both languages. The analysis will cover grammatical, semantic and informative aspects in order to establish a multi-linguistic prototype of the constructions. To that end, data from several giga-token corpora of contemporary spoken English and Spanish (parallel and comparable) have been selected. This study advances the explanatory potential of constructional idioms for the study of idiomaticity, variability and cross-language analysis. In addition, relevant findings on the dialectal distribution of certain idiom features across both languages and their national varieties are also reported.
    • El hablar y el discurso repetido: la fraseología

      Mellado, Carmen; Corpas, Gloria; Berty, Katrin; Loureda, Óscar; Schrott, Angela (De Gruyter, 2021-01-18)
      Este capitulo muestra la interrelacion entre fijacion y variabilidad en las unidades fraseologicas desde distintos puntos de vista. En primer lugar, realizamos un analisis detallado del concepto de «discurso repetido» de Coseriu, que ya considera en su origen la idea de cambio creativo, para despues ofrecer una panoramica de la evolucion de la fraseologia en relacion a la lingilistica textual. En segundo lugar, se presenta una clasificacion de la tipologia de la variacion fraseologica, ilustrada con ejemplos de corpus lingiiisticos y centrada en los niveles del sistema y habla, asi como en la intencionalidad del hablante. En tercer lugar, tratamos el tema de la variabilidad fraseologica y el giro que ha tornado la nocion de «fijacion» desde que se dispone de datos masivos de corpus. En este contexto, las magnitudes de frecuencia absoluta, normalizada y de significacion estadistica desempeiian un papel fundamental para el grado de fijacion.
    • Attention: there is an inconsistency between android permissions and application metadata!

      Alecakir, Huseyin; Can, Burcu; Sen, Sevil (Springer Science and Business Media LLC, 2021-01-07)
      Since mobile applications make our lives easier, there is a large number of mobile applications customized for our needs in the application markets. While the application markets provide us a platform for downloading applications, it is also used by malware developers in order to distribute their malicious applications. In Android, permissions are used to prevent users from installing applications that might violate the users’ privacy by raising their awareness. From the privacy and security point of view, if the functionality of applications is given in sufficient detail in their descriptions, then the requirement of requested permissions could be well-understood. This is defined as description-to-permission fidelity in the literature. In this study, we propose two novel models that address the inconsistencies between the application descriptions and the requested permissions. The proposed models are based on the current state-of-art neural architectures called attention mechanisms. Here, we aim to find the permission statement words or sentences in app descriptions by using the attention mechanism along with recurrent neural networks. The lack of such permission statements in application descriptions creates a suspicion. Hence, the proposed approach could assist in static analysis techniques in order to find suspicious apps and to prioritize apps for more resource intensive analysis techniques. The experimental results show that the proposed approach achieves high accuracy.
    • Turkish music generation using deep learning

      Aydıngün, Anıl; Bağdatlıoğlu, Denizcan; Canbaz, Burak; Kökbıyık, Abdullah; Yavuz, M Furkan; Bölücü, Necva; Can, Burcu (IEEE, 2021-01-07)
      Bu çalı¸smada derin ögrenme ile Türkçe ¸sarkı bes- ˘ teleme üzerine yeni bir model tanıtılmaktadır. ¸Sarkı sözlerinin Tekrarlı Sinir Agları kullanan bir dil modeliyle otomatik olarak ˘ olu¸sturuldugu, melodiyi meydana getiren notaların da benzer ˘ ¸sekilde nöral dil modeliyle olu¸sturuldugu ve sözler ile melodinin ˘ bütünle¸stirilerek ¸sarkı sentezlemenin gerçekle¸stirildigi bu çalı¸sma ˘ Türkçe ¸sarkı besteleme için yapılan ilk çalı¸smadır. In this work, a new model is introduced for Turkish song generation using deep learning. It will be the first work on Turkish song generation that makes use of Recurrent Neural Networks to generate the lyrics automatically along with a language model, where the melody is also generated by a neural language model analogously, and then the singing synthesis is performed by combining the lyrics with the melody. It will be the first work on Turkish song generation.