• Early Mendeley readers correlate with later citation counts

      Thelwall, Mike (Springer, 2018-03-26)
      Counts of the number of readers registered in the social reference manager Mendeley have been proposed as an early impact indicator for journal articles. Although previous research has shown that Mendeley reader counts for articles tend to have a strong positive correlation with synchronous citation counts after a few years, no previous studies have compared early Mendeley reader counts with later citation counts. In response, this first diachronic analysis compares reader counts within a month of publication with citation counts after 20 months for ten fields. There were moderate or strong correlations in eight out of ten fields, with the two exceptions being the smallest categories (n=18, 36) with wide confidence intervals. The correlations are higher than the correlations between later citations and early citations, showing that Mendeley reader counts are more useful early impact indicators than citation counts.
    • An effective and scalable framework for authorship attribution query processing

      Sarwar, R; Yu, C; Tungare, N; Chitavisutthivong, K; Sriratanawilai, S; Xu, Y; Chow, D; Rakthanmanon, T; Nutanong, S (Institute of Electrical and Electronics Engineers (IEEE), 2018-09-10)
      Authorship attribution aims at identifying the original author of an anonymous text from a given set of candidate authors and has a wide range of applications. The main challenge in authorship attribution problem is that the real-world applications tend to have hundreds of authors, while each author may have a small number of text samples, e.g., 5-10 texts/author. As a result, building a predictive model that can accurately identify the author of an anonymous text is a challenging task. In fact, existing authorship attribution solutions based on long text focus on application scenarios, where the number of candidate authors is limited to 50. These solutions generally report a significant performance reduction as the number of authors increases. To overcome this challenge, we propose a novel data representation model that captures stylistic variations within each document, which transforms the problem of authorship attribution into a similarity search problem. Based on this data representation model, we also propose a similarity query processing technique that can effectively handle outliers. We assess the accuracy of our proposed method against the state-of-the-art authorship attribution methods using real-world data sets extracted from Project Gutenberg. Our data set contains 3000 novels from 500 authors. Experimental results from this paper show that our method significantly outperforms all competitors. Specifically, as for the closed-set and open-set authorship attribution problems, our method have achieved higher than 95% accuracy.
    • Effective websites for small and medium-sized enterprises

      Thelwall, Mike (MCB UP Ltd, 2000)
      In the UK, millions are now online and many are prepared to use the Internet to make and influence purchasing decisions. Businesses should, therefore, consider whether the Internet could provide them with a new marketing opportunity. Although increasing numbers of businesses now have a website, there seems to be a quality problem that is leading to missed opportunities, particularly for smaller enterprises. This belief is backed up by an automated survey of 3,802 predominantly small UK business sites, believed to be by far the largest of its kind to date. Analysis of the results reveals widespread problems in relation to search engines. Most Internet users find new sites through search engines, yet over half of the sites checked were not registered in the largest one, Yahoo!, and could therefore be missing a sizeable percentage of potential customers. The underlying problem with business sites is the lack of maturity of the medium as evidenced by the focus on technological issues amongst designers and the inevitable lack of Web-business experience of managers. Designers need to take seriously the usability of the site, its design and its ability to meet the business goals of the client. These issues are perhaps being taken up less than in the related discipline of software engineering, probably owing to the relative ease of website creation. Managers need to dictate the objectives of their site, but also, in the current climate, cannot rely even on professional website design companies and must be capable of evaluating the quality of their site themselves. Finally, educators need to ensure that these issues are emphasised to the next generation of designers and managers in order that the full potential of the Internet for business can be realised.
    • Effects of lexical properties on viewing time per word in autistic and neurotypical readers

      Štajner, Sanja; Yaneva, Victoria; Mitkov, Ruslan; Ponzetto, Simone Paolo (Association of Computational Linguistics, 2017-09-08)
      Eye tracking studies from the past few decades have shaped the way we think of word complexity and cognitive load: words that are long, rare and ambiguous are more difficult to read. However, online processing techniques have been scarcely applied to investigating the reading difficulties of people with autism and what vocabulary is challenging for them. We present parallel gaze data obtained from adult readers with autism and a control group of neurotypical readers and show that the former required higher cognitive effort to comprehend the texts as evidenced by three gaze-based measures. We divide all words into four classes based on their viewing times for both groups and investigate the relationship between longer viewing times and word length, word frequency, and four cognitively-based measures (word concreteness, familiarity, age of acquisition and imagability).
    • El EEES y la competencia tecnológica: los nuevos grados en Traducción

      Corpas Pastor, Gloria; Muñoz, María (Universidad de Las Palmas de Gran Canaria, Servicio de Publicaciones y Difusión Científica, 2015-04-23)
      El presente trabajo toma como punto de partida la investigación que se describe en Muñoz Ramos (2012). En él haremos una breve síntesis del origen y evolución del EEES hasta llegar a nuestros días y su repercusión en los estudios de Traducción. Daremos cuenta de la imbricación existente entre los principios constitutivos del Proceso de Bolonia y las Tecnologías de la Información y Comunicación (TIC), que se posicionan como las compañeras idóneas para la consecución de los objetivos de la Declaración de Bolonia. Finalmente, podremos comprobar cómo estos dos puntos convergen en los nuevos grados en Traducción españoles, que se ajustan al EEES y encuentran en las materias de tecnologías de la traducción la piedra angular de su razón de ser.
    • El hablar y el discurso repetido: la fraseología

      Mellado, Carmen; Corpas, Gloria; Berty, Katrin; Loureda, Óscar; Schrott, Angela (De Gruyter, 2021-01-18)
      Este capitulo muestra la interrelacion entre fijacion y variabilidad en las unidades fraseologicas desde distintos puntos de vista. En primer lugar, realizamos un analisis detallado del concepto de «discurso repetido» de Coseriu, que ya considera en su origen la idea de cambio creativo, para despues ofrecer una panoramica de la evolucion de la fraseologia en relacion a la lingilistica textual. En segundo lugar, se presenta una clasificacion de la tipologia de la variacion fraseologica, ilustrada con ejemplos de corpus lingiiisticos y centrada en los niveles del sistema y habla, asi como en la intencionalidad del hablante. En tercer lugar, tratamos el tema de la variabilidad fraseologica y el giro que ha tornado la nocion de «fijacion» desde que se dispone de datos masivos de corpus. En este contexto, las magnitudes de frecuencia absoluta, normalizada y de significacion estadistica desempeiian un papel fundamental para el grado de fijacion.
    • Enhancing unsupervised sentence similarity methods with deep contextualised word representations

      Ranashinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (RANLP, 2019-09-02)
      Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains.
    • Estrategias heurísticas con corpus para la enseñanza de la fraseología orientada a la traducción

      Corpas Pastor, Gloria; Hidalgo Ternero, Carlos Manuel; Seghiri, Miriam (Peter Lang, 2020)
      This work presents a didactic proposal carried out in the subject Lengua y cultura “B” aplicadas a la Traducción e Interpretación (II) – inglés, taught in the first year of the Bache-lor’s Degree in Translation and Interpreting, at the University of Malaga. The main objec-tive of this proposal is to teach the possibilities that both monolingual and bilingual corpora can provide for the correct identification and interpretation of phraseological units with regard to their translation, paying special attention to those cases where the ambiguity of phraseological sequences may lead to multiple interpretations. We will focus on somatisms and will mainly use two Spanish monolingual corpora (CORPES XXI and esEuTenTen), an English monolingual corpus (enTenTen) and two parallel corpora (Europarl and Linguee, more specifically its English-Spanish subcorpus). Against this background, this proposal is divided into several learning activities. After a first seminar where the concepts of corpus, phraseology and translation are introduced, in the learning activity 2 we will use parallel corpora to find translation pairings that contain translation mistakes caused by problems with phraseological ambiguity. Then, in the third learning activity, we will teach some disambiguating elements that will facilitate a correct identification and interpretation of the phraseological unit, in order to be able to convey its pragmatic and semantic weight in the target text. It is in this step where corpora can play a decisive role as documentation tools. Nevertheless, the localisation and interpretation of phraseological units is not problem-free. Given the necessity to develop some techniques that will enable a more effective detection of phraseological units, in the fourth learning activity students will learn an array of heuris-tic strategies to refine their searches in the consulted corpora as well as to select adequate equivalences after a correct interpretation of the results produced by these corpora.
    • Evaluation of a cross-lingual Romanian-English multi-document summariser

      Orǎsan, C; Chiorean, OA (European Language Resources Association, 2008-01-01)
      The rapid growth of the Internet means that more information is available than ever before. Multilingual multi-document summarisation offers a way to access this information even when it is not in a language spoken by the reader by extracting the GIST from related documents and translating it automatically. This paper presents an experiment in which Maximal Marginal Relevance (MMR), a well known multi-document summarisation method, is used to produce summaries from Romanian news articles. A task-based evaluation performed on both the original summaries and on their automatically translated versions reveals that they still contain a significant portion of the important information from the original texts. However, direct evaluation of the automatically translated summaries shows that they are not very legible and this can put off some readers who want to find out more about a topic.
    • Evidence for the existence of geographic trends in university web site interlinking

      Thelwall, Mike (MCB UP Ltd, 2002)
      The Web is an important medium for scholarly communication of various types, perhaps eventually to replace entirely some traditional mechanisms such as print journals. Yet the Web analogy of citations, hyperlinks, are much more varied in use and existing citation techniques are difficult to generalise to the new medium. In this context, one new challenging object of study is the modern multi-faceted, multi-genre, partly unregulated university Web site. This paper develops a methodology to analyse the patterns of interlinking between university Web sites and uses it to indicate that the degree of interlinking decreases with distance, at least in the UK. This is perhaps not in itself a surprising result, despite claims of a paradigm shift from the traditional virtual college towards collaboratories, but the methodology developed can also be used to refine existing Web link metrics to produce more powerful tools for comparing groups of sites.
    • Exploiting Data-Driven Hybrid Approaches to Translation in the EXPERT Project

      Orăsan, Constantin; Escartín, Carla Parra; Torres, Lianet Sepúlveda; Barbu, Eduard; Ji, Meng; Oakes, Michael (Cambridge University Press, 2019-06-13)
      Technologies have transformed the way we work, and this is also applicable to the translation industry. In the past thirty to thirty-five years, professional translators have experienced an increased technification of their work. Barely thirty years ago, a professional translator would not have received a translation assignment attached to an e-mail or via an FTP and yet, for the younger generation of professional translators, receiving an assignment by electronic means is the only reality they know. In addition, as pointed out in several works such as Folaron (2010) and Kenny (2011), professional translators now have a myriad of tools available to use in the translation process.
    • Exploiting tweet sentiments in altmetrics large-scale data

      Hassan, Saeed-Ul; Aljohani, Naif Radi; Iqbal Tarar, Usman; Safder, Iqra; Sarwar, Raheem; Alelyani, Salem; Nawaz, Raheel (SAGE, 2022-12-31)
      This article aims to exploit social exchanges on scientific literature, specifically tweets, to analyse social media users' sentiments towards publications within a research field. First, we employ the SentiStrength tool, extended with newly created lexicon terms, to classify the sentiments of 6,482,260 tweets associated with 1,083,535 publications provided by Altmetric.com. Then, we propose harmonic means-based statistical measures to generate a specialized lexicon, using positive and negative sentiment scores and frequency metrics. Next, we adopt a novel article-level summarization approach to domain-level sentiment analysis to gauge the opinion of social media users on Twitter about the scientific literature. Last, we propose and employ an aspect-based analytical approach to mine users' expressions relating to various aspects of the article, such as tweets on its title, abstract, methodology, conclusion, or results section. We show that research communities exhibit dissimilar sentiments towards their respective fields. The analysis of the field-wise distribution of article aspects shows that in Medicine, Economics, Business & Decision Sciences, tweet aspects are focused on the results section. In contrast, Physics & Astronomy, Materials Sciences, and Computer Science these aspects are focused on the methodology section. Overall, the study helps us to understand the sentiments of online social exchanges of the scientific community on scientific literature. Specifically, such a fine-grained analysis may help research communities in improving their social media exchanges about the scientific articles to disseminate their scientific findings effectively and to further increase their societal impact.
    • An exploratory analysis of multilingual word-level quality estimation with cross-lingual transformers

      Ranasinghe, Tharindu; Orasan, Constantin; Mitkov, Ruslan (Association for Computational Linguistics, 2021-08-31)
      Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that these QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.
    • An exploratory study on multilingual quality estimation

      Sun, Shuo; Fomicheva, Marina; Blain, Frederic; Chaudhary, Vishrav; El-Kishky, Ahmed; Renduchintala, Adithya; Guzman, Francisco; Specia, Lucia (Association for Computational Linguistics, 2020-12-31)
      Predicting the quality of machine translation has traditionally been addressed with language-specific models, under the assumption that the quality label distribution or linguistic features exhibit traits that are not shared across languages. An obvious disadvantage of this approach is the need for labelled data for each given language pair. We challenge this assumption by exploring different approaches to multilingual Quality Estimation (QE), including using scores from translation models. We show that these outperform singlelanguage models, particularly in less balanced quality label distributions and low-resource settings. In the extreme case of zero-shot QE, we show that it is possible to accurately predict quality for any given new language from models trained on other languages. Our findings indicate that state-of-the-art neural QE models based on powerful pre-trained representations generalise well across languages, making them more applicable in real-world settings.
    • Extracción de fraseología para intérpretes a partir de corpus comparables compilados mediante reconocimiento automático del habla

      Corpas Pastor, Gloria; Gaber, Mahmoud; Corpas Pastor, Gloria; Bautista Zambrana, María Rosario; Hidalgo Ternero, Carlos Manuel (Editorial Comares, 2021-10-04)
      Today, automatic speech recognition is beginning to emerge strongly in the field of interpreting. Recent studies point to this technology as one of the main documentation resources for interpreters, among other possible uses. In this paper we present a novel documentation methodology that involves semi-automatic compilation of comparable corpora (transcriptions of oral speeches) and automatic corpus compilation of written documents on the same topic with a view to preparing an interpreting assignment. As a convenient background, we provide a brief overview of the use of automatic speech recognition in the context of interpreting technologies. Next, we will detail the protocol for designing and compiling our comparable corpora that we will exploit for analysis. In the last part of the paper, we will cover phraseology extraction and study some collocational patterns in both corpora. Mastering the specific phraseology of the specific subject matter of the assignment is one of the main stumbling blocks that interpreters face in their daily work. Our ultimate aim is to establish whether oral corpora could be of further benefit to the interpreter in the preliminary preparation phase.
    • FGFR1 expression and role in migration in low and high grade pediatric gliomas

      Egbivwie, Naomi; Cockle, Julia V.; Humphries, Matthew; Ismail, Azzam; Esteves, Filomena; Taylor, Claire; Karakoula, Katherine; Morton, Ruth; Warr, Tracy; Short, Susan C.; et al. (Frontiers Media, 2019-03-13)
      The heterogeneous and invasive nature of pediatric gliomas poses significant treatment challenges, highlighting the importance of identifying novel chemotherapeutic targets. Recently, recurrent Fibroblast growth factor receptor 1 (FGFR1) mutations in pediatric gliomas have been reported. Here, we explored the clinical relevance of FGFR1 expression, cell migration in low and high grade pediatric gliomas and the role of FGFR1 in cell migration/invasion as a potential chemotherapeutic target. A high density tissue microarray (TMA) was used to investigate associations between FGFR1 and activated phosphorylated FGFR1 (pFGFR1) expression and various clinicopathologic parameters. Expression of FGFR1 and pFGFR1 were measured by immunofluorescence and by immunohistochemistry (IHC) in 3D spheroids in five rare patient-derived pediatric low-grade glioma (pLGG) and two established high-grade glioma (pHGG) cell lines. Two-dimensional (2D) and three-dimensional (3D) migration assays were performed for migration and inhibitor studies with three FGFR1 inhibitors. High FGFR1 expression was associated with age, malignancy, tumor location and tumor grade among astrocytomas. Membranous pFGFR1 was associated with malignancy and tumor grade. All glioma cell lines exhibited varying levels of FGFR1 and pFGFR1 expression and migratory phenotypes. There were significant anti-migratory effects on the pHGG cell lines with inhibitor treatment and anti-migratory or pro-migratory responses to FGFR1 inhibition in the pLGGs. Our findings support further research to target FGFR1 signaling in pediatric gliomas.
    • Figshare: A universal repository for academic resource sharing?

      Thelwall, Mike; Kousha, Kayvan (Emerald Group Publishing Limited, 2015-12-18)
      Purpose A number of subject-orientated and general websites have emerged to host academic resources. It is important to evaluate the uptake of such services in order to decide which depositing strategies are effective and should be encouraged. Design/methodology/approach This article evaluates the views and shares of resources in the generic repository Figshare by subject category and resource type. Findings Figshare use and common resource types vary substantially by subject category but resources can be highly viewed even in subjects with few members. Subject areas with more resources deposited do not tend to have higher viewing or sharing statistics. Practical implications Limited uptake of Figshare within a subject area should not be a barrier to its use. Several highly successful innovative uses for Figshare show that it can reach beyond a purely academic audience. Originality/value This is the first analysis of the uptake and use of a generic academic resource sharing repository.
    • Finding similar academic Web sites with links, bibliometric couplings and colinks

      Thelwall, Mike; Wilkinson, David (Elsevier, 2004)
      A common task in both Webmetrics and Web information retrieval is to identify a set of Web pages or sites that are similar in content. In this paper we assess the extent to which links, colinks and couplings can be used to identify similar Web sites. As an experiment, a random sample of 500 pairs of domains from the UK academic Web were taken and human assessments of site similarity, based upon content type, were compared against ratings for the three concepts. The results show that using a combination of all three gives the highest probability of identifying similar sites, but surprisingly this was only a marginal improvement over using links alone. Another unexpected result was that high values for either colink counts or couplings were associated with only a small increased likelihood of similarity. The principal advantage of using couplings and colinks was found to be greater coverage in terms of a much larger number of pairs of sites being connected by these measures, instead of increased probability of similarity. In information retrieval terminology, this is improved recall rather than improved precision.
    • Findings of the WMT 2018 shared task on quality estimation

      Specia, Lucia; Blain, Frederic; Logacheva, Varvara; Astudillo, Ramón; Martins, André (Association for Computational Linguistics, 2018-11)
      We report the results of the WMT18 shared task on Quality Estimation, i.e. the task of predicting the quality of the output of machine translation systems at various granularity levels: word, phrase, sentence and document. This year we include four language pairs, three text domains, and translations produced by both statistical and neural machine translation systems. Participating teams from ten institutions submitted a variety of systems to different task variants and language pairs.
    • Findings of the WMT 2020 shared task on quality estimation

      Specia, Lucia; Blain, Frédéric; Fomicheva, Marina; Fonseca, Erick; Chaudhary, Vishrav; Guzmán, Francisco; Martins, André FT (Association for Computational Linguistics, 2020-11-30)
      We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels. This edition included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian-English, Estonian-English, Sinhala-English and Nepali-English data for the sentence-level subtasks, English-German and English-Chinese for the word-level subtask, and English-French data for the document-level subtask. In addition, we made neural machine translation models available to participants. 19 participating teams from 27 institutions submitted altogether 1374 systems to different task variants and language pairs.