Research Institute in Information and Language Processing
Recent Submissions
-
Advances in automatic terminology processing: methodology and applications in focusThe information and knowledge era, in which we are living, creates challenges in many fields, and terminology is not an exception. The challenges include an exponential growth in the number of specialised documents that are available, in which terms are presented, and the number of newly introduced concepts and terms, which are already beyond our (manual) capacity. A promising solution to this ‘information overload’ would be to employ automatic or semi-automatic procedures to enable individuals and/or small groups to efficiently build high quality terminologies from their own resources which closely reflect their individual objectives and viewpoints. Automatic terminology processing (ATP) techniques have already proved to be quite reliable, and can save human time in terminology processing. However, they are not without weaknesses, one of which is that these techniques often consider terms to be independent lexical units satisfying some criteria, when terms are, in fact, integral parts of a coherent system (a terminology). This observation is supported by the discussion of the notion of terms and terminology and the review of existing approaches in ATP presented in this thesis. In order to overcome the aforementioned weakness, we propose a novel methodology in ATP which is able to extract a terminology as a whole. The proposed methodology is based on knowledge patterns automatically extracted from glossaries, which we considered to be valuable, but overlooked resources. These automatically identified knowledge patterns are used to extract terms, their relations and descriptions from corpora. The extracted information can facilitate the construction of a terminology as a coherent system. The study also aims to discuss applications of ATP, and describes an experiment in which ATP is integrated into a new NLP application: multiplechoice test item generation. The successful integration of the system shows that ATP is a viable technology, and should be exploited more by other NLP applications.
-
Do altmetric scores reflect article quality? Evidence from the UK Research Excellence Framework 2021Altmetrics are web-based quantitative impact or attention indicators for academic articles that have been proposed to supplement citation counts. This article reports the first assessment of the extent to which mature altmetrics from Altmetric.com and Mendeley associate with individual article quality scores. It exploits expert norm-referenced peer review scores from the UK Research Excellence Framework 2021 for 67,030+ journal articles in all fields 2014-17/18, split into 34 broadly field-based Units of Assessment (UoAs). Altmetrics correlated more strongly with research quality than previously found, although less strongly than raw and field normalised Scopus citation counts. Surprisingly, field normalising citation counts can reduce their strength as a quality indicator for articles in a single field. For most UoAs, Mendeley reader counts are the best altmetric (e.g., three Spearman correlations with quality scores above 0.5), tweet counts are also a moderate strength indicator in eight UoAs (Spearman correlations with quality scores above 0.3), ahead of news (8 correlations above 0.3, but generally weaker), blogs (5 correlations above 0.3), and Facebook (3 correlations above 0.3) citations, at least in the UK. In general, altmetrics are the strongest indicators of research quality in the health and physical sciences and weakest in the arts and humanities. Keywords: Altmetrics, Research Excellence Framework, REF2021, alternative indicators, scientometrics, bibliometrics, field normalised citations.
-
Are successful co-authors more important than first authors for publishing academic journal articles?Academic research often involves teams of experts, and it seems reasonable to believe that successful main authors or co-authors would tend to help produce better research. This article investigates an aspect of this across science with an indirect method: the extent to which the publishing record of an article’s authors associates with the citation impact of the publishing journal (as a proxy for the quality of the article). The data is based on author career publishing evidence for journal articles 2014-20 and the journals of articles published in 2017. At the Scopus broad field level, international correlations and country-specific regressions for five English-speaking nations (Australia, Ireland, New Zealand, UK and USA) suggest that first author citation impact is more important than co-author citation impact, but co-author productivity is more important than first author productivity. Moreover, author citation impact is more important than author productivity. There are disciplinary differences in the results, with first author productivity surprisingly tending to be a disadvantage in the physical sciences and life sciences, at least in the sense of associating with lower impact journals. The results are limited by the regressions only including domestic research and a lack of evidence based cause-and-effect explanations. Nevertheless, the data suggests that impactful team members are more important than productive team members, and that whilst an impactful first author is a science-wide advantage, an experienced first author is often not.
-
Is big team research fair in national research assessments? The case of the UK Research Excellence Framework 2021Collaborative research causes problems for research assessments because of the difficulty in fairly crediting its authors. Whilst splitting the rewards for an article amongst its authors has the greatest surface-level fairness, many important evaluations assign full credit to each author, irrespective of team size. The underlying rationales for this are labour reduction and the need to incentivise collaborative work because it is necessary to solve many important societal problems. This article assesses whether full counting changes results compared to fractional counting in the case of the UK’s Research Excellence Framework (REF) 2021. For this assessment, fractional counting reduces the number of journal articles to as little as 10% of the full counting value, depending on the Unit of Assessment (UoA). Despite this large difference, allocating an overall grade point average (GPA) based on full counting or fractional counting give results with a median Pearson correlation within UoAs of 0.98. The largest changes are for Archaeology (r=0.84) and Physics (r=0.88). There is a weak tendency for higher scoring institutions to lose from fractional counting, with the loss being statistically significant in 5 of the 34 UoAs. Thus, whilst the apparent over-weighting of contributions to collaboratively authored outputs does not seem too problematic from a fairness perspective overall, it may be worth examining in the few UoAs in which it makes the most difference.
-
Terms in journal articles associating with high quality: can qualitative research be world-leading?Purpose: Scholars often aim to conduct high quality research and their success is judged primarily by peer reviewers. Research quality is difficult for either group to identify, however, and misunderstandings can reduce the efficiency of the scientific enterprise. In response, we use a novel term association strategy to seek quantitative evidence of aspects of research that associate with high or low quality. Design/methodology/approach: We extracted the words and 2–5-word phrases most strongly associating with different quality scores in each of 34 Units of Assessment (UoAs) in the Research Excellence Framework (REF) 2021. We extracted the terms from 122,331 journal articles 2014-2020 with individual REF2021 quality scores. Findings: The terms associating with high- or low-quality scores vary between fields but relate to writing styles, methods, and topics. We show that the first-person writing style strongly associates with higher quality research in many areas because it is the norm for a set of large prestigious journals. We found methods and topics that associate with both high- and low-quality scores. Worryingly, terms associating with educational and qualitative research attract lower quality scores in multiple areas. REF experts may rarely give high scores to qualitative or educational research because the authors tend to be less competent, because it is harder to make world leading research with these themes, or because they do not value them. Originality: This is the first investigation of journal article terms associating with research quality.
-
Data sharing and reuse practices: Disciplinary differences and improvements neededPurpose This study investigates differences and commonalities in data production, sharing and reuse across the widest range of disciplines yet, and identifies types of improvements needed to promote data sharing and reuse. Design The first authors of randomly selected publications from 2018 and 2019 in 20 Scopus disciplines were surveyed for their beliefs and experiences about data sharing and reuse. Findings From the 3,257 survey responses, data sharing and reuse are still increasing but not ubiquitous in any subject area and are more common among experienced researchers. Researchers with previous data reuse experience were more likely to share data than others. Types of data produced and systematic online data sharing varied substantially between subject areas. Although the use of institutional and journal-supported repositories for sharing data is increasing, personal websites are still frequently used. Combining multiple existing datasets to answer new research questions was the most common use. Proper documentation, openness, and information on the usability of data continue to be important when searching for existing datasets. However, researchers in most disciplines struggled to find datasets to reuse. Researcher feedback suggested 23 recommendations to promote data sharing and reuse, including improved data access and usability, formal data citations, new search features, and cultural and policy-related disciplinary changes to increase awareness and acceptance. Originality This study is the first to explore data sharing and reuse practices across the full range of academic discipline types. It expands and updates previous data sharing surveys and suggests new areas of improvement in terms of policy, guidance, and training programs.
-
Digital footprints of Kashmiri pandit migration on TwitterThe paper investigates changing levels of online concern about the Kashmiri Pandit migration of the 1990s on Twitter. Although decades old, this movement of people is an ongoing issue in India, with no current resolution. Analysing changing reactions to it on social media may shed light on trends in public attitudes to the event. Tweets were downloaded from Twitter using the academic version of its application programming interface (API) with the aid of the free social media analytics software Mozdeh. A set of 1000 tweets was selected for content analysis with a random number generator in Mozdeh. The results show that the number of tweets about the issue has increased over time, mainly from India, and predominantly driven by the release of films like Shikara and The Kashmir Files. The tweets show apparent universal sup-port for the Pandits but often express strong emotions or criticize the actions of politicians, showing that the migration is an ongoing source of anguish and frustration that needs resolution. The results also show that social media analysis can give insights even into primarily offline political issues that predate the popularity of the web, and can easily incorporate international perspectives necessary to understand complex migration issues.
-
Why are medical research articles tweeted? The news value perspectiveCounts of tweets mentioning research articles are potentially useful as social impact altmetric indicators, especially for health-related topics. One way to help understand what tweet counts indicate is to find factors that associate with the number of tweets received by articles. Using news value theory, this study examined six characteristics of research papers that may cause some articles to be more tweeted than others. For this, we manually coded 300 medical journal articles about COVID-19. A statistical analysis showed that all six factors that make articles more newsworthy according to news value theory (importance, controversy, elite nations, elite persons, scale, news prominence) associated with higher tweet counts. Since these factors are hypothesised to be general human news selection criteria, the results give new evidence that tweet counts may be indicators of general interest to members of society rather than measures of societal impact. This study also provides a new understanding of the strong positive relationship between news mentions and tweet counts for articles. Instead of news coverage attracting tweets or the other way round (journalists noticing highly tweeted articles and writing about them), the results are consistent with newsworthy characteristics of articles attracting both tweets and news mentions.
-
“I don’t think education is the answer”: a corpus-assisted ecolinguistic analysis of plastics discourses in the UKEcosystems around the world are becoming engulfed in single-use plastics, the majority of which come from plastic packaging. Reusable plastic packaging systems have been proposed in response to this plastic waste crisis, but uptake of such systems in the UK is still very low. This article draws on a thematic corpus of 5.6 million words of UK English around plastics, packaging, reuse, and recycling to examine consumer attitudes towards plastic (re)use. Utilizing methods and insights from ecolinguistics, corpus linguistics, and cognitive linguistics, this article assesses to what degree consumer language differs from that of public-facing bodies such as supermarkets and government entities. A predefined ecosophy, prioritizing protection, rights, systems thinking, and fairness, is used to not only critically evaluate narratives in plastics discourse but also to recommend strategies for more effective and ecologically beneficial communications around plastics and reuse. This article recommends the adoption of ecosophy in multidisciplinary project teams, and argues that ecosophies are conducive to transparent and reproducible discourse analysis. The analysis also suggests that in order to make meaningful change in packaging reuse behaviors, it is highly likely that deeply ingrained cultural stories around power, rights, and responsibilities will need to be directly challenged.
-
The USMLE® Step 2 clinical skills patient note corpusThis paper presents a corpus of 43,985 clinical patient notes (PNs) written by 35,156 examinees during the high-stakes USMLE® Step 2 Clinical Skills examination. In this exam, examinees interact with standardized patients - people trained to portray simulated scenarios called clinical cases. For each encounter, an examinee writes a PN, which is then scored by physician raters using a rubric of clinical concepts, expressions of which should be present in the PN. The corpus features PNs from 10 clinical cases, as well as the clinical concepts from the case rubrics. A subset of 2,840 PNs were annotated by 10 physician experts such that all 143 concepts from the case rubrics (e.g., shortness of breath) were mapped to 34,660 PN phrases (e.g., dyspnea, difficulty breathing). The corpus is available via a data sharing agreement with NBME and can be requested at https://www.nbme.org/services/data-sharing.
-
Author gender identification for Urdu articlesIn recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machines outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.
-
TurkishDelightNLP: A neural Turkish NLP toolkitWe introduce a neural Turkish NLP toolkit called TurkishDelightNLP that performs computational linguistic analyses from morphological level to semantic level that involves tasks such as stemming, morphological segmentation, morphological tagging, part-of-speech tagging, dependency parsing, and semantic parsing, as well as high-level NLP tasks such as named entity recognition. We publicly share the open-source Turkish NLP toolkit through a web interface that allows an input text to be analysed in real-time, as well as the open source implementation of the components provided in the toolkit, an API, and several annotated datasets such as word similarity test set to evaluate word embeddings and UCCA-based semantic annotation in Turkish. This will be the first open-source Turkish NLP toolkit that involves a range of NLP tasks in all levels. We believe that it will be useful for other researchers in Turkish NLP and will be also beneficial for other high-level NLP tasks in Turkish.
-
Turkish universal conceptual cognitive annotationUniversal Conceptual Cognitive Annotation (UCCA) is a cross-lingual semantic annotation framework that provides an easy annotation without any requirement for linguistic background. UCCA-annotated datasets have been already released in English, French, and German. In this paper, we introduce the first UCCA-annotated Turkish dataset that currently involves 50 sentences obtained from the METU-Sabanci Turkish Treebank. We followed a semi-automatic annotation approach, where an external semantic parser is utilised for an initial annotation of the dataset, which is partially accurate and requires refinement. We manually revised the annotations obtained from the semantic parser that are not in line with the UCCA rules that we defined for Turkish. We used the same external semantic parser for evaluation purposes and conducted experiments with both zero-shot and few-shot learning. This is the initial version of the annotated dataset and we are currently extending the dataset. We are releasing the current Turkish UCCA annotation guideline along with the annotated dataset.
-
Overview of the HASOC subtrack at FIRE 2021: Hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speechThe HASOC track is dedicated to the evaluation of technology for finding Offensive Language and Hate Speech. HASOC is creating a multilingual data corpus mainly for English and under-resourced languages(Hindi and Marathi). This paper presents one HASOC subtrack with two tasks. In 2021, we organized the classification task for English, Hindi, and Marathi. The first task consists of two classification tasks; Subtask 1A consists of a binary and fine-grained classification into offensive and non-offensive tweets. Subtask 1B asks to classify the tweets into Hate, Profane and offensive. Task 2 consists of identifying tweets given additional context in the form of the preceding conversion. During the shared task, 65 teams have submitted 652 runs. This overview paper briefly presents the task descriptions, the data and the results obtained from the participant's submission.
-
Predicting lexical complexity in English texts: the Complex 2.0 datasetIdentifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
-
TransQuest: Translation quality estimation with cross-lingual transformersRecent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.
-
RGCL at SemEval-2020 task 6: Neural approaches to definition extractionThis paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.
-
You are driving me up the wall! A corpus-based study of a special class of resultative constructionsThis paper focuses on resultative constructions from a computational and corpus-based approach. We claim that the array of expressions (traditionally classed as idioms, collocations, free word combinations, etc.) that are used to convey a person’s change of mental state (typically negative) are basically instances of the same resultative construction. The first part of the study will introduce basic tenets of Construction Grammar and resultatives. Then, our corpus-based methodology will be spelled out, including a description of the two giga-token corpora used and a detailed account of our protocolised heuristic strategies and tasks. Distributional analysis of matrix slot fillers will be presented next, together with a discussion on restrictions, novel instances, and productivity. A final section will round up our study, with special attention to notions like “idiomaticity”, “productivity” and “variability” of the pairings of form and meaning analysed. To the best of our knowledge, this is one of the first studies based on giga-token corpora that explores idioms as integral parts of higher-order resultative constructions.
-
Multilingual offensive language identification for low-resource languagesOffensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.
-
Intelligent translation memory matching and retrieval with sentence encodersMatching and retrieving previously translated segments from a Translation Memory is the key functionality in Translation Memories systems. However this matching and retrieving process is still limited to algorithms based on edit distance which we have identified as a major drawback in Translation Memories systems. In this paper we introduce sentence encoders to improve the matching and retrieving process in Translation Memories systems - an effective and efficient solution to replace edit distance based algorithms.