Recent Submissions

  • Terms in journal articles associating with high quality: can qualitative research be world-leading?

    Thelwall, Mike; Kousha, Kayvan; Abdoli, Mahshid; Stuart, Emma; Makita, Meiko; Wilson, Paul; Levitt, Jonathan (Emerald, 2023-12-31)
    Purpose: Scholars often aim to conduct high quality research and their success is judged primarily by peer reviewers. Research quality is difficult for either group to identify, however, and misunderstandings can reduce the efficiency of the scientific enterprise. In response, we use a novel term association strategy to seek quantitative evidence of aspects of research that associate with high or low quality. Design/methodology/approach: We extracted the words and 2–5-word phrases most strongly associating with different quality scores in each of 34 Units of Assessment (UoAs) in the Research Excellence Framework (REF) 2021. We extracted the terms from 122,331 journal articles 2014-2020 with individual REF2021 quality scores. Findings: The terms associating with high- or low-quality scores vary between fields but relate to writing styles, methods, and topics. We show that the first-person writing style strongly associates with higher quality research in many areas because it is the norm for a set of large prestigious journals. We found methods and topics that associate with both high- and low-quality scores. Worryingly, terms associating with educational and qualitative research attract lower quality scores in multiple areas. REF experts may rarely give high scores to qualitative or educational research because the authors tend to be less competent, because it is harder to make world leading research with these themes, or because they do not value them. Originality: This is the first investigation of journal article terms associating with research quality.
  • Data sharing and reuse practices: Disciplinary differences and improvements needed

    Khan, Nushrat; Thelwall, Mike; Kousha, Kayvan (Emerald, 2023-12-31)
    Purpose This study investigates differences and commonalities in data production, sharing and reuse across the widest range of disciplines yet, and identifies types of improvements needed to promote data sharing and reuse. Design The first authors of randomly selected publications from 2018 and 2019 in 20 Scopus disciplines were surveyed for their beliefs and experiences about data sharing and reuse. Findings From the 3,257 survey responses, data sharing and reuse are still increasing but not ubiquitous in any subject area and are more common among experienced researchers. Researchers with previous data reuse experience were more likely to share data than others. Types of data produced and systematic online data sharing varied substantially between subject areas. Although the use of institutional and journal-supported repositories for sharing data is increasing, personal websites are still frequently used. Combining multiple existing datasets to answer new research questions was the most common use. Proper documentation, openness, and information on the usability of data continue to be important when searching for existing datasets. However, researchers in most disciplines struggled to find datasets to reuse. Researcher feedback suggested 23 recommendations to promote data sharing and reuse, including improved data access and usability, formal data citations, new search features, and cultural and policy-related disciplinary changes to increase awareness and acceptance. Originality This study is the first to explore data sharing and reuse practices across the full range of academic discipline types. It expands and updates previous data sharing surveys and suggests new areas of improvement in terms of policy, guidance, and training programs.
  • Digital footprints of Kashmiri pandit migration on Twitter

    Gulzar, Farzana; Gul, Sumeer; Mehraj, Midhat; Bano, Shohar; Thelwall, Mike (Ediciones Profesionales de la Informacion SL, 2022-11-16)
    The paper investigates changing levels of online concern about the Kashmiri Pandit migration of the 1990s on Twitter. Although decades old, this movement of people is an ongoing issue in India, with no current resolution. Analysing changing reactions to it on social media may shed light on trends in public attitudes to the event. Tweets were downloaded from Twitter using the academic version of its application programming interface (API) with the aid of the free social media analytics software Mozdeh. A set of 1000 tweets was selected for content analysis with a random number generator in Mozdeh. The results show that the number of tweets about the issue has increased over time, mainly from India, and predominantly driven by the release of films like Shikara and The Kashmir Files. The tweets show apparent universal sup-port for the Pandits but often express strong emotions or criticize the actions of politicians, showing that the migration is an ongoing source of anguish and frustration that needs resolution. The results also show that social media analysis can give insights even into primarily offline political issues that predate the popularity of the web, and can easily incorporate international perspectives necessary to understand complex migration issues.
  • Why are medical research articles tweeted? The news value perspective

    Htoo, Tint Hla Hla; Na, Jin-Cheon; Thelwall, Mike (Springer, 2022-11-14)
    Counts of tweets mentioning research articles are potentially useful as social impact altmetric indicators, especially for health-related topics. One way to help understand what tweet counts indicate is to find factors that associate with the number of tweets received by articles. Using news value theory, this study examined six characteristics of research papers that may cause some articles to be more tweeted than others. For this, we manually coded 300 medical journal articles about COVID-19. A statistical analysis showed that all six factors that make articles more newsworthy according to news value theory (importance, controversy, elite nations, elite persons, scale, news prominence) associated with higher tweet counts. Since these factors are hypothesised to be general human news selection criteria, the results give new evidence that tweet counts may be indicators of general interest to members of society rather than measures of societal impact. This study also provides a new understanding of the strong positive relationship between news mentions and tweet counts for articles. Instead of news coverage attracting tweets or the other way round (journalists noticing highly tweeted articles and writing about them), the results are consistent with newsworthy characteristics of articles attracting both tweets and news mentions.
  • “I don’t think education is the answer”: a corpus-assisted ecolinguistic analysis of plastics discourses in the UK

    Franklin, Emma; Gavins, Joanna; Mehl, Seth (De Gruyter Mouton, 2022-08-15)
    Ecosystems around the world are becoming engulfed in single-use plastics, the majority of which come from plastic packaging. Reusable plastic packaging systems have been proposed in response to this plastic waste crisis, but uptake of such systems in the UK is still very low. This article draws on a thematic corpus of 5.6 million words of UK English around plastics, packaging, reuse, and recycling to examine consumer attitudes towards plastic (re)use. Utilizing methods and insights from ecolinguistics, corpus linguistics, and cognitive linguistics, this article assesses to what degree consumer language differs from that of public-facing bodies such as supermarkets and government entities. A predefined ecosophy, prioritizing protection, rights, systems thinking, and fairness, is used to not only critically evaluate narratives in plastics discourse but also to recommend strategies for more effective and ecologically beneficial communications around plastics and reuse. This article recommends the adoption of ecosophy in multidisciplinary project teams, and argues that ecosophies are conducive to transparent and reproducible discourse analysis. The analysis also suggests that in order to make meaningful change in packaging reuse behaviors, it is highly likely that deeply ingrained cultural stories around power, rights, and responsibilities will need to be directly challenged.
  • The USMLE® Step 2 clinical skills patient note corpus

    Yaneva, Victoria; Mee, Janet; Ha, Le An; Harik, Polina; Jodoin, Michael; Mechaber, Alex (Association for Computational Linguistics, 2022-07-31)
    This paper presents a corpus of 43,985 clinical patient notes (PNs) written by 35,156 examinees during the high-stakes USMLE® Step 2 Clinical Skills examination. In this exam, examinees interact with standardized patients - people trained to portray simulated scenarios called clinical cases. For each encounter, an examinee writes a PN, which is then scored by physician raters using a rubric of clinical concepts, expressions of which should be present in the PN. The corpus features PNs from 10 clinical cases, as well as the clinical concepts from the case rubrics. A subset of 2,840 PNs were annotated by 10 physician experts such that all 143 concepts from the case rubrics (e.g., shortness of breath) were mapped to 34,660 PN phrases (e.g., dyspnea, difficulty breathing). The corpus is available via a data sharing agreement with NBME and can be requested at https://www.nbme.org/services/data-sharing.
  • Author gender identification for Urdu articles

    Sarwar, Raheem; Corpas Pastor, Gloria; Mitkov, Ruslan (Springer, 2022-09-21)
    In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machines outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.
  • TurkishDelightNLP: A neural Turkish NLP toolkit

    Alecakir, Huseyin; Bölücü, Necva; Can, Burcu (ACL, 2022-07-01)
    We introduce a neural Turkish NLP toolkit called TurkishDelightNLP that performs computational linguistic analyses from morphological level to semantic level that involves tasks such as stemming, morphological segmentation, morphological tagging, part-of-speech tagging, dependency parsing, and semantic parsing, as well as high-level NLP tasks such as named entity recognition. We publicly share the open-source Turkish NLP toolkit through a web interface that allows an input text to be analysed in real-time, as well as the open source implementation of the components provided in the toolkit, an API, and several annotated datasets such as word similarity test set to evaluate word embeddings and UCCA-based semantic annotation in Turkish. This will be the first open-source Turkish NLP toolkit that involves a range of NLP tasks in all levels. We believe that it will be useful for other researchers in Turkish NLP and will be also beneficial for other high-level NLP tasks in Turkish.
  • Turkish universal conceptual cognitive annotation

    Bölücü, Necva; Can, Burcu; Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; et al. (European Language Resources Association, 2022-06-01)
    Universal Conceptual Cognitive Annotation (UCCA) is a cross-lingual semantic annotation framework that provides an easy annotation without any requirement for linguistic background. UCCA-annotated datasets have been already released in English, French, and German. In this paper, we introduce the first UCCA-annotated Turkish dataset that currently involves 50 sentences obtained from the METU-Sabanci Turkish Treebank. We followed a semi-automatic annotation approach, where an external semantic parser is utilised for an initial annotation of the dataset, which is partially accurate and requires refinement. We manually revised the annotations obtained from the semantic parser that are not in line with the UCCA rules that we defined for Turkish. We used the same external semantic parser for evaluation purposes and conducted experiments with both zero-shot and few-shot learning. This is the initial version of the annotated dataset and we are currently extending the dataset. We are releasing the current Turkish UCCA annotation guideline along with the annotated dataset.
  • Overview of the HASOC subtrack at FIRE 2021: Hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech

    Mandl, Thomas; Modha, Sandip; Shahi, Gautam Kishore; Madhu, Hiren; Satapara, Shrey; Majumder, Prasenjit; Schäfer, Johannes; Ranasinghe, Tharindu; Zampieri, Marcos; Nandini, Durgesh; et al. (Association for Computing Machinery, 2021-12-13)
    The HASOC track is dedicated to the evaluation of technology for finding Offensive Language and Hate Speech. HASOC is creating a multilingual data corpus mainly for English and under-resourced languages(Hindi and Marathi). This paper presents one HASOC subtrack with two tasks. In 2021, we organized the classification task for English, Hindi, and Marathi. The first task consists of two classification tasks; Subtask 1A consists of a binary and fine-grained classification into offensive and non-offensive tweets. Subtask 1B asks to classify the tweets into Hate, Profane and offensive. Task 2 consists of identifying tweets given additional context in the form of the preceding conversion. During the shared task, 65 teams have submitted 652 runs. This overview paper briefly presents the task descriptions, the data and the results obtained from the participant's submission.
  • Predicting lexical complexity in English texts: the Complex 2.0 dataset

    Shardlow, Matthew; Evans, Richard; Zampieri, Marcos (Springer, 2022-03-23)
    Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
  • TransQuest: Translation quality estimation with cross-lingual transformers

    Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (International Committee on Computational Linguistics, 2020-12-31)
    Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.
  • RGCL at SemEval-2020 task 6: Neural approaches to definition extraction

    Ranasinghe, Tharindu; Plum, Alistair; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2020-12-31)
    This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.
  • You are driving me up the wall! A corpus-based study of a special class of resultative constructions

    Corpas Pastor, Gloria (Université Jean Moulin - Lyon 3, 2022-03-26)
    This paper focuses on resultative constructions from a computational and corpus-based approach. We claim that the array of expressions (traditionally classed as idioms, collocations, free word combinations, etc.) that are used to convey a person’s change of mental state (typically negative) are basically instances of the same resultative construction. The first part of the study will introduce basic tenets of Construction Grammar and resultatives. Then, our corpus-based methodology will be spelled out, including a description of the two giga-token corpora used and a detailed account of our protocolised heuristic strategies and tasks. Distributional analysis of matrix slot fillers will be presented next, together with a discussion on restrictions, novel instances, and productivity. A final section will round up our study, with special attention to notions like “idiomaticity”, “productivity” and “variability” of the pairings of form and meaning analysed. To the best of our knowledge, this is one of the first studies based on giga-token corpora that explores idioms as integral parts of higher-order resultative constructions.
  • Multilingual offensive language identification for low-resource languages

    Ranasinghe, Tharindu; Zampieri, Marcos (Association for Computing Machinery, 2021-11-10)
    Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.
  • Intelligent translation memory matching and retrieval with sentence encoders

    Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2020-11-30)
    Matching and retrieving previously translated segments from a Translation Memory is the key functionality in Translation Memories systems. However this matching and retrieving process is still limited to algorithms based on edit distance which we have identified as a major drawback in Translation Memories systems. In this paper we introduce sentence encoders to improve the matching and retrieving process in Translation Memories systems - an effective and efficient solution to replace edit distance based algorithms.
  • TransQuest at WMT2020: Sentence-Level direct assessment

    Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2020-11-30)
    This paper presents the team TransQuest's participation in Sentence-Level Direct Assessment shared task in WMT 2020. We introduce a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. The proposed methods achieve state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. We further fine tune the QE framework by performing ensemble and data augmentation. Our approach is the winning solution in all of the language pairs according to the WMT 2020 official results.
  • Tuning language representation models for classification of Turkish news

    Tokgöz, Meltem; Turhan, Fatmanur; Bölücü, Necva; Can, Burcu (ACM, 2021-02-19)
    Pre-trained language representation models are very efficient in learning language representation independent from natural language processing tasks to be performed. The language representation models such as BERT and DistilBERT have achieved amazing results in many language understanding tasks. Studies on text classification problems in the literature are generally carried out for the English language. This study aims to classify the news in the Turkish language using pre-trained language representation models. In this study, we utilize BERT and DistilBERT by tuning both models for the text classification task to learn the categories of Turkish news with different tokenization methods. We provide a quantitative analysis of the performance of BERT and DistilBERT on the Turkish news dataset by comparing the models in terms of their representation capability in the text classification task. The highest performance is obtained with DistilBERT with an accuracy of 97.4%.
  • LSTM Ağları ile Türkçe Kök Bulma

    Can, Burcu (Gazi Üniversitesi, 2019-07-31)
    Türkçe, morfem adı verilen birimlerin art arda eklenmesiyle sözcüklerin oluşturulduğu sondan eklemeli bir dildir. Sözcüklerin farklı parçaların birleştirilmesiyle oluşturulması makine tercümesi, duygu analizi ve bilgi çıkarımı gibi birçok doğal dil işleme uygulamasında seyreklik problemine yol açmaktadır çünkü sözcüğün her farklı formu farklı bir sözcük gibi algılanmaktadır. Bu makalede, sözcüklerin yapım ve çekim eklerinden arındırılarak köklerinin otomatik olarak bulunabilmesi için bir yöntem öneriyoruz. Kullandığımız yöntem tekrarlayan sinir ağları kullanarak oluşturulan kodlayıcı-kod çözücü yaklaşımına dayanmaktadır. Verilen herhangi bir sözcük, oluşturduğumuz sinir ağı yapısı ile öncelikle kodlanmakta, ardından kodu çözülerek köküne ulaşılabilmektedir. Bu yöntem şimdiye kadar etiketleme veya makine tercümesi gibi problemlerde kullanılmıştır. Diğer Türkçe kök bulma modelleriyle karşılaştırıldığında sonuçların oldukça iyi olduğu gözlenmiştir. Diğer modellerde olduğu gibi, herhangi bir kural kümesi elle tanımlanmadan, sadece sözcük ve kök ikililerinden oluşan bir eğitim veri kümesi kullanılarak kök bulma işlemi önerdiğimiz bu model ile gerçekleştirilebilmektedir.
  • MLQE-PE: A multilingual quality estimation and post-editing dataset

    Fomicheva, Marina; Sun, Shuo; Fonseca, Erick; Zerva, Chrysoula; Blain, Frédéric; Chaudhary, Vishrav; Guzmán, Francisco; Lopatina, Nina; Specia, Lucia; Martins, André FT (arXiv, 2020-10-11)
    We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains eleven language pairs, with human labels for up to 10,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level good/bad labels. It also contains the post-edited sentences, as well as titles of the articles where the sentences were extracted from, and the neural MT models used to translate the text.

View more