• New directions in the study of family names

      Hanks, Patrick; Boullón Agrelo, Ana Isabel (Consello da Cultura Galega, 2018-12-28)
      This paper explores and explains recent radical developments in resources and methodology for studying the origins, cultural associations, and histories of family names (also called ‘surnames’). It summarizes the current state of the art and outlines new resources and procedures that are now becoming available. It shows how such innovations can enable the correction of errors in previous work and improve the accuracy of dictionaries of family names, with a focus on the English-speaking world. Developments such as the digitization of archives are having a profound effect, not only on the interpretation and understanding of traditional, ‘established’ family names and their histories, but also of names in other languages and other cultures. There are literally millions of different family names in the world today, many of which have never been studied at all. What are good criteria for selection of entries in a dictionary of family names, and what can be said about them? What is the nature of the evidence? How stable (or how variable) are family names over time? What are the effects of factors such as migration? What is the relationship between family names and geographical locations, given that people can and do move around? What is the relationship between traditional philological and historical approaches to the subject and statistical analysis of newly available digitized data? The paper aims to contribute to productive discussion of such questions.
    • New versions of PageRank employing alternative Web document models

      Thelwall, Mike; Vaughan, Liwen (Emerald Group Publishing Limited, 2004)
      Introduces several new versions of PageRank (the link based Web page ranking algorithm), based on an information science perspective on the concept of the Web document. Although the Web page is the typical indivisible unit of information in search engine results and most Web information retrieval algorithms, other research has suggested that aggregating pages based on directories and domains gives promising alternatives, particularly when Web links are the object of study. The new algorithms introduced based on these alternatives were used to rank four sets of Web pages. The ranking results were compared with human subjects’ rankings. The results of the tests were somewhat inconclusive: the new approach worked well for the set that includes pages from different Web sites; however, it does not work well in ranking pages that are from the same site. It seems that the new algorithms may be effective for some tasks but not for others, especially when only low numbers of links are involved or the pages to be ranked are from the same site or directory.
    • News stories as evidence for research? BBC citations from articles, books and Wikipedia

      Kousha, Kayvan; Thelwall, Mike (Wiley, 2017-07-17)
      Although news stories target the general public and are sometimes inaccurate, they can serve as sources of real-world information for researchers. This article investigates the extent to which academics exploit journalism using content and citation analyses of online BBC News stories cited by Scopus articles. A total of 27,234 Scopus-indexed publications have cited at least one BBC News story, with a steady annual increase. Citations from arts and humanities (2.8% of publications in 2015) and social sciences (1.5%) were more likely than citations from medicine (0.1%) and science (<0.1%). Surprisingly, half of the sampled Scopus-cited science and technology (53%) and medicine and health (47%) stories were based on academic research, rather than otherwise unpublished information, suggesting that researchers have chosen a lower quality secondary source for their citations. Nevertheless, the BBC News stories that were most frequently cited by Scopus, Google Books and Wikipedia introduced new information from many different topics, including politics, business, economics, statistics, and reports about events. Thus, news stories are mediating real world knowledge into the academic domain, a potential cause for concern.
    • Nine terminology extraction Tools: Are they useful for translators?

      Costa, Hernani; Zaretskaya, Anna; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam (MultiLingual, 2016-04-01)
      Terminology extraction tools (TETs) have become an indispensable resource in education, research and business. Today, users can find a great variety of terminology extraction tools of all kinds, and they all offer different features. Apart from many other areas, these tools are especially helpful in the professional translation setting. We do not know, however, if the existing tools have all the necessary features for this kind of work. In search for the answer, we looked at nine selected tools available on the market to find out if they provide the translators’ most favorite features.
    • Not all international collaboration is beneficial: The Mendeley readership and citation impact of biochemical research collaboration

      Sud, Pardeep; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1SB UK; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1SB UK (Wiley Blackwell, 2015-05-13)
      Biochemistry is a highly funded research area that is typified by large research teams and is important for many areas of the life sciences. This article investigates the citation impact and Mendeley readership impact of biochemistry research from 2011 in the Web of Science according to the type of collaboration involved. Negative binomial regression models are used that incorporate, for the first time, the inclusion of specific countries within a team. The results show that, holding other factors constant, larger teams robustly associate with higher impact research, but including additional departments has no effect and adding extra institutions tends to reduce the impact of research. Although international collaboration is apparently not advantageous in general, collaboration with the USA, and perhaps also with some other countries, seems to increase impact. In contrast, collaborations with some other nations associate with lower impact, although both findings could be due to factors such as differing national proportions of excellent researchers. As a methodological implication, simpler statistical models would have found international collaboration to be generally beneficial and so it is important to take into account specific countries when examining collaboration.
    • NP animacy identification for anaphora resolution

      Orasan, Constantin; Evans, Richard (American Association for Artificial Intelligence, 2007)
      In anaphora resolution for English, animacy identification can play an integral role in the application of agreement restrictions between pronouns and candidates, and as a result, can improve the accuracy of anaphora resolution systems. In this paper, two methods for animacy identification are proposed and evaluated using intrinsic and extrinsic measures. The first method is a rule-based one which uses information about the unique beginners in WordNet to classify NPs on the basis of their animacy. The second method relies on a machine learning algorithm which exploits a WordNet enriched with animacy information for each sense. The effect of word sense disambiguation on the two methods is also assessed. The intrinsic evaluation reveals that the machine learning method reaches human levels of performance. The extrinsic evaluation demonstrates that animacy identification can be beneficial in anaphora resolution, especially in the cases where animate entities are identified with high precision.
    • Patent citation analysis with Google

      Kousha, Kayvan; Thelwall, Mike (Wiley-Blackwell, 2015-09-23)
      Citations from patents to scientific publications provide useful evidence about the commercial impact of academic research, but automatically searchable databases are needed to exploit this connection for large-scale patent citation evaluations. Google covers multiple different international patent office databases but does not index patent citations or allow automatic searches. In response, this article introduces a semiautomatic indirect method via Bing to extract and filter patent citations from Google to academic papers with an overall precision of 98%. The method was evaluated with 322,192 science and engineering Scopus articles from every second year for the period 1996–2012. Although manual Google Patent searches give more results, especially for articles with many patent citations, the difference is not large enough to be a major problem. Within Biomedical Engineering, Biotechnology, and Pharmacology & Pharmaceutics, 7% to 10% of Scopus articles had at least one patent citation but other fields had far fewer, so patent citation analysis is only relevant for a minority of publications. Low but positive correlations between Google Patent citations and Scopus citations across all fields suggest that traditional citation counts cannot substitute for patent citations when evaluating research.
    • Predicting reading difficulty for readers with autism spectrum disorder

      Evans, Richard; Yaneva, Victoria; Temnikova, Irina (European Language Resources Association, 2016-05-23)
      People with autism experience various reading comprehension difficulties, which is one explanation for the early school dropout, reduced academic achievement and lower levels of employment in this population. To overcome this issue, content developers who want to make their textbooks, websites or social media accessible to people with autism (and thus for every other user) but who are not necessarily experts in autism, can benefit from tools which are easy to use, which can assess the accessibility of their content, and which are sensitive to the difficulties that autistic people might have when processing texts/websites. In this paper we present a preliminary machine learning readability model for English developed specifically for the needs of adults with autism. We evaluate the model on the ASD corpus, which has been developed specifically for this task and is, so far, the only corpus for which readability for people with autism has been evaluated. The results show that out model outperforms the baseline, which is the widely-used Flesch-Kincaid Grade Level formula.
    • Predicting the difficulty of multiple choice questions in a high-stakes medical exam

      Ha, Le; Yaneva, Victoria; Balwin, Peter; Mee, Janet (Association for Computational Linguistics, 2019-08-02)
      Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs.
    • Predicting the Type and Target of Offensive Posts in Social Media

      Zampieri, Marcos; Malmasi, Shervin; Nakov, Preslav; Rosenthal, Sara; Farra, Noura; Kumar, Ritesh (Association for Computational Linguistics, 2019-06-01)
      As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.
    • Profiling idioms: a sociolexical approach to the study of phraseological patterns

      Moze, Sara; Mohamed, Emad (Springer, 2019-12-31)
      This paper introduces a novel approach to the study of lexical and pragmatic meaning called ‘sociolexical profiling’, which aims at correlating the use of lexical items with author-attributed demographic features, such as gender, age, profession, and education. The approach was applied to a case study of a set of English idioms derived from the Pattern Dictionary of English Verbs (PDEV), a corpus-driven lexical resource which defines verb senses in terms of the phraseological patterns in which a verb typically occurs. For each selected idiom, a gender profile was generated based on data extracted from the Blog Authorship Corpus (BAC) in order to establish whether any statistically significant differences can be detected in the way men and women use idioms in every-day communication. A quantitative and qualitative analysis of the gender profiles was subsequently performed, enabling us to test the validity of the proposed approach. If performed on a large scale, we believe that sociolexical profiling will have important implications for several areas of research, including corpus lexicography, translation, creative writing, forensic linguistics, and natural language processing.
    • Questing for Quality Estimation A User Study

      Escartin, Carla Parra; Béchara, Hanna; Orăsan, Constantin (de Gruyter, 2017-06-06)
      Post-Editing of Machine Translation (MT) has become a reality in professional translation workflows. In order to optimize the management of projects that use post-editing and avoid underpayments and mistrust from professional translators, effective tools to assess the quality of Machine Translation (MT) systems need to be put in place. One field of study that could address this problem is Machine Translation Quality Estimation (MTQE), which aims to determine the quality of MT without an existing reference. Accurate and reliable MTQE can help project managers and translators alike, as it would allow estimating more precisely the cost of post-editing projects in terms of time and adequate fares by discarding those segments that are not worth post-editing (PE) and have to be translated from scratch. In this paper, we report on the results of an impact study which engages professional translators in PE tasks using MTQE. We measured translators? productivity in different scenarios: translating from scratch, post-editing without using MTQE, and post-editing using MTQE. Our results show that QE information, when accurate, improves post-editing efficiency.
    • Reader and author gender and genre in Goodreads

      Thelwall, Mike (Sage, 2017-05-01)
      There are known gender differences in book preferences in terms of both genre and author gender but their extent and causes are not well understood. It is unclear whether reader preferences for author genders occur within any or all genres and whether readers evaluate books differently based on author genders within specific genres. This article exploits a major source of informal book reviews, the Goodreads.com website, to assess the influence of reader and author genders on book evaluations within genres. It uses a quantitative analysis of 201,560 books and their reviews, focusing on the top 50 user-specified genres. The results show strong gender differences in the ratings given by reviewers to books within genres, such as female reviewers rating contemporary romance more highly, with males preferring short stories. For most common book genres, reviewers give higher ratings to books authored by their own gender, confirming that gender bias is not confined to the literary elite. The main exception is the comic book, for which male reviewers prefer female authors, despite their scarcity. A word frequency analysis suggested that authors wrote, and reviewers valued, gendered aspects of books within a genre. For example, relationships and romance were disproportionately mentioned by women in mystery and fantasy novels. These results show that, perhaps for the first time, it is possible to get large scale evidence about the reception of books by typical readers, if they post reviews online.
    • The reading background of Goodreads book club members: A female fiction canon?

      Thelwall, Mike; Bourrier, Karen (Emerald, 2019-09-09)
      Purpose - Despite the social, educational and therapeutic benefits of book clubs, little is known about which books participants are likely to have read. In response, this article investigates the public bookshelves of those that have joined a group within the Goodreads social network site. Design/methodology/approach – Books listed as read by members of fifty large English language Goodreads groups - with a genre focus or other theme - were compiled by author and title. Findings – Recent and youth-oriented fiction dominate the fifty books most read by book club members, while almost half are works of literature frequently taught at the secondary and postsecondary level (literary classics). Whilst JK Rowling is almost ubiquitous (at least 63% as frequently listed as other authors in any group, including groups for other genres), most authors, including Shakespeare (15%), Goulding (6%) and Hemmingway (9%), are little read by some groups. Nor are individual recent literary prize-winners or works in languages other than English frequently read. Research limitations/implications – Although these results are derived from a single popular website, knowing more about what book club members are likely to have read should help participants, organisers and moderators. For example, recent literary prize winners might be a good choice, given that few members may have read them. Originality/value – This is the first large scale study of book group members’ reading patterns. Whilst typical reading is likely to vary by group theme and average age, there seems to be a mainly female canon of about 14 authors and 19 books that Goodreads book club members are likely to have read.
    • Recepción en España de la literatura africana en lengua inglesa: generación de datos estadísticos con la base de datos bibliográfica especializada BDÁFRICA

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Fundacio per la Universitat Oberta de Catalunya, 2018-11-20)
      El presente artículo examina la recepción de la literatura africana en lengua inglesa en España basándonos en BDÁFRICA, una base de datos bibliográfica que recoge obras de autores nacidos en África y publicadas en español y en España entre 1972 y 2014. Se ofrece una reflexión crítica de las dificultades para definir la literatura africana como objeto de estudio, debido a su complejidad y heterogeneidad. Se propone, además, un conciso recorrido historiográfico por la conformación del canon de dicha literatura que se ha realizado desde Occidente. Asimismo, se demuestra la falta de estudios estadísticos sobre la recepción de literatura africana en lengua inglesa en España. Respondiendo a esta necesidad, el objetivo del artículo es detallar y analizar los datos estadísticos inéditos que proporciona la base de datos, adoptando una metodología descriptiva. Los resultados de este estudio, que aporta datos cuantitativos y cualitativos fiables y novedosos, son originales en tanto en cuanto reflejan y señalan los problemas de la traducción de la literatura africana en lengua inglesa en España. BDÁFRICA, que es gratuita y está disponible en red, pretende ser un recurso y una fuente que estimule el desarrollo de la investigación en literatura poscolonial en España. Sin duda, esta base de datos bibliográfica especializada es una herramienta muy valiosa, especialmente para investigadores, traductores y editoriales interesados en literatura africana.
    • Recursos documentales para la traducción de seguros turísticos en el par de lenguas inglés-español

      Corpas Pastor, Gloria; Seghiri Domínguez, Miriam; Postigo Pinazo, Encarnación (Universidad de Málaga, 2007-04-05)
      Las páginas que siguen a continuación resumen parte de la investigación realizada en el marco de un proyecto de I+D interdisciplinar e interuniversitario sobre Tecnologías de la Traducción, denominado TURICOR (BFF2003-04616, MCYT), cuyos objetivos principales son la compilación virtual de un corpus multilingüe de contratación turística a partir de recursos electrónicos y el desarrollo de un sistema de generación de lenguaje natural (GLN), también multilingü. El corpus Turicor alberga, pues, diversos tipos de documentos relativos a la contratación turística en las cuatro lenguas implicadas (español, inglés, alemán e italiano). En concreto, la tipologíatextual que ha vertebrado la selección de los documentos que integran los distintossubcorpus de los que consta Turicor abarca lo siguiente: legislación turística (internacional, comunitaria y nacional de los respectivos países incluidos); condiciones generales, formularios y contratos turísticos.
    • Refined Salience Weighting and Error Analysis in Anaphora Resolution.

      Evans, Richard (The Research Group in Computational Linguistics, 2002)
      In this paper, the behaviour of an existing pronominal anaphora resolution system is modified so that different types of pronoun are treated in different ways. Weights are derived using a genetic algorithm for the outcomes of tests applied by this branching algorithm. Detailed evaluation and error analysis is undertaken. Proposals for future research are put forward.
    • Register-Specific Collocational Constructions in English and Spanish: A Usage-Based Approach

      Pastor, Gloria Corpas (Science Publications, 2015-03-01)
      Constructions are usage-based, conventionalised pairings of form and function within a cline of complexity and schematisation. Most research within Construction Grammar has focused on the monolingual description of schematic constructions: Mainly in English, but to a lesser extent in other languages as well. By contrast, very little constructional analyses have been carried out across languages. In this study we will focus on a type of partially substantive construction from the point of view of contrastive analysis and translation which, to the best of our knowledge, is one of the first studies of this kind. The first half of the article lays down the theoretical foundations of the study and introduces Construction Grammar as well as other formalisms used in literature in order to provide a construal account of collocations, a pervasive phenomenon in language. The experimental part describes the case study of V NP collocations with disease/enfermedad in comparable corpora in English and Spanish, both in the general domain and in the specialised medical domain. It is provided a comparative analysis of these constructions across domains and languages in terms of token-type ratio (constructional restriction-rate), lexical function, type of determiner, frequency ranking of the verbal collocate and domain specificity of collocates, among others. New measures to assess construal bondness will be put forward (lexical filledness rate and individual productivity rate) and special attention will be paid to register-dependent equivalent semantic-functional counterparts in English and Spanish and mismatches.
    • A report on the Third VarDial evaluation campaign

      Zampieri, Marcos; Malmasi, Shervin; Scherrer, Yves; Samardžić, Tanja; Tyers, Francis; Silfverberg, Miikka; Klyueva, Natalia; Pan, Tung-Le; Huang, Chu-Ren; Ionescu, Radu Tudor; et al. (Association for Computational Linguistics, 2019-12-31)
      In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.
    • Research dissemination and invocation on the Web

      Thelwall, Mike (MCB UP Ltd, 2002)
      The importance of the Web as a new medium for disseminating and promoting scholarly research is discussed. Particular attention is paid to its potential to provide evidence of wider impact for research than that which can be shown by citation analysis. Recommendations are made for basic strategies for the reporting of the online impact of research leading to the production of what is termed a Web Invocation Portfolio. A conceptual framework is also proposed to help funding and promotion committees assess and compare portfolios.