• Language evolution and the spread of ideas on the Web: A procedure for identifying emergent hybrid word family members

      Thelwall, Mike; Price, Liz (Wiley, 2006)
      Word usage is of interest to linguists for its own sake as well as to social scientists and others who seek to track the spread of ideas, for example, in public debates over political decisions. The historical evolution of language can be analyzed with the tools of corpus linguistics through evolving corpora and the Web. But word usage statistics can only be gathered for known words. In this article, techniques are described and tested for identifying new words from the Web, focusing on the case when the words are related to a topic and have a hybrid form with a common sequence of letters. The results highlight the need to employ a combination of search techniques and show the wide potential of hybrid word family investigations in linguistics and social science.
    • Language resources for Italian: Towards the development of a corpus of annotated Italian multiword expressions

      Taslimipoor, Shiva; Desantis, Anna; Cherchi, Manuela; Mitkov, Ruslan; Monti, Johanna (ceur-ws, 2016-12-05)
      This paper describes the first resource annotated for multiword expressions (MWEs) in Italian. Two versions of this dataset have been prepared: the first with a fast markup list of out-of-context MWEs, and the second with an in-context annotation, where the MWEs are entered with their contexts. The paper also discusses annotation issues and reports the inter-annotator agreement for both types of annotations. Finally, the results of the first exploitation of the new resource, namely the automatic extraction of Italian MWEs, are presented.
    • Large-scale data harvesting for biographical data

      Plum, Alistair; Zampieri, Marcos; Orasan, Constantin; Wandl-Vogt, Eveline; Mitkov, R (CEUR, 2019-09-05)
      This paper explores automatic methods to identify relevant biography candidates in large databases, and extract biographical information from encyclopedia entries and databases. In this work, relevant candidates are defined as people who have made an impact in a certain country or region within a pre-defined time frame. We investigate the case of people who had an impact in the Republic of Austria and died between 1951 and 2019. We use Wikipedia and Wikidata as data sources and compare the performance of our information extraction methods on these two databases. We demonstrate the usefulness of a natural language processing pipeline to identify suitable biography candidates and, in a second stage, extract relevant information about them. Even though they are considered by many as an identical resource, our results show that the data from Wikipedia and Wikidata differs in some cases and they can be used in a complementary way providing more data for the compilation of biographies.
    • Laughing one's head off in Spanish subtitles: a corpus-based study on diatopic variation and its consequences for translation

      Corpas Pastor, Gloria; Mogorrón, Pedro; Martines, Vicent (John Benjamins, 2018-11-08)
      Looking for phraseological information is common practice among translators. When rendering idioms, information is mostly needed to find the appropriate equivalent, but, also, to check usage and diasystemic restrictions. One of the most complex issues in this respect is diatopic variation. English and Spanish are transnational languages that are spoken in several countries around the globe. Crossvariety differences as regards idiomaticity range from the actual choice of phraseological units, to different lexical or grammatical variants, usage preferences and differential distribution. In this respect, translators are severely underequipped as regards information found in dictionaries. While some diatopic marks are generally used to indicate geographical restrictions, not all idioms are clearly identified and very little information is provided about preferences and/or crucial differences that occur when the same idiom is used in various national varieties. In translation, source language textemes usually turn into target language repertoremes, i.e. established units within the target system. Toury’s law of growing standardisation helps explaining why translated texts tend to be more simple, conventional and prototypical than non-translated texts, among other characteristic features. Provided a substantial part of translational Spanish is composed of textual repertoremes, any source textemes are bound to be ‘dissolved’ into typical ways of expressing in ‘standard’ Spanish. This means filtering source idiomatic diatopy through the ‘neutral, standard sieve’. This paper delves into the rendering into Spanish of the English idiom to laugh one’s head off. After a cursory look at the notions of transnational and translational Spanish(es) in Section 2, Section 3 analyses the translation strategies deployed in a giga-token parallel subcorpus of Spanish-English subtitles. In Section 4, dictionary and textual equivalents retrieved from the parallel corpus are studied against the background of two sets of synonymous idioms for ‘laughing out loud’ in 19 giga-token comparable subcorpora of Spanish national varieties. Corpas Pastor’s (2015) corpus-based research protocol will be adopted in order to uncover varietal differences, detect diatopic configurations and derive consequences for contrastive studies and translation, as summarised in Section 5. This is the first study, to the best of our knowledge, investigating the translation of to laugh one’s head off and also analysing the Spanish equivalent idioms in national and transnational corpora.
    • Leveraging large corpora for translation using the Sketch Engine

      Moze, Sarah; Krek, Simon (Cambridge University Press, 2018)
    • Linguistic features of genre and method variation in translation: A computational perspective

      Lapshinova-Koltunski, Ekaterina; Zampieri, Marcos; Legallois, Dominique; Charnois, Thierry; Larjavaara, Meri (Mouton De Gruyter, 2018-04-09)
      In this contribution we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this study is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation.
    • Linguistic patterns of academic Web use in Western Europe

      Thelwall, Mike; Tang, Rong; Price, Liz (Springer, 2003)
      A survey of linguistic dimensions of Web site hosting and interlinking of the universities of sixteen European countries is described. The results show that English is the dominant language both for linking pages and for all pages. In a typical country approximately half the pages were in English and half in one or more national languages. Normalised interlinking patterns showed three trends: 1) international interlinking throughout Europe in English, and additionally in Swedish in Scandinavia; 2) linking between countries sharing a common language, and 3) countries extensively hosting international links in their own major languages. This provides evidence for the multilingual character of academic use of the Web in Western Europe, at least outside the UK and Eire. Evidence was found that Greece was significantly linguistically isolated from the rest of the EU but that outsiders Norway and Switzerland were not.
    • Linking Verb Pattern Dictionaries of English and Spanish

      Baisa, Vít; Moze, Sara; Renau, Irene (ELRA, 2016-05-24)
      The paper presents the first step in the creation of a new multilingual and corpus-driven lexical resource by means of linking existing monolingual pattern dictionaries of English and Spanish verbs. The two dictionaries were compiled through Corpus Pattern Analysis (CPA) – an empirical procedure in corpus linguistics that associates word meaning with word use by means of analysis of phraseological patterns and collocations found in corpus data. This paper provides a first look into a number of practical issues arising from the task of linking corresponding patterns across languages via both manual and automatic procedures. In order to facilitate manual pattern linking, we implemented a heuristic-based algorithm to generate automatic suggestions for candidate verb pattern pairs, which obtained 80% precision. Our goal is to kick-start the development of a new resource for verbs that can be used by language learners, translators, editors and the research community alike.
    • Long term productivity and collaboration in information science

      Thelwall, Mike; Levitt, Jonathan (Springer, 2016-07-02)
      Funding bodies have tended to encourage collaborative research because it is generally more highly cited than sole author research. But higher mean citation for collaborative articles does not imply collaborative researchers are in general more research productive. This article assesses the extent to which research productivity varies with the number of collaborative partners for long term researchers within three Web of Science subject areas: Information Science & Library Science, Communication and Medical Informatics. When using the whole number counting system, researchers who worked in groups of 2 or 3 were generally the most productive, in terms of producing the most papers and citations. However, when using fractional counting, researchers who worked in groups of 1 or 2 were generally the most productive. The findings need to be interpreted cautiously, however, because authors that produce few academic articles within a field may publish in other fields or leave academia and contribute to society in other ways.
    • Mendeley readership altmetrics for medical articles: An analysis of 45 fields

      Wilson, Paul; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK (Wiley Blackwell, 2015-05)
      2330-1643
    • Methodologies for crawler based Web surveys.

      Thelwall, Mike (MCB UP Ltd, 2002)
      There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well-known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.
    • Microsoft Academic automatic document searches: accuracy for journal articles and suitability for citation analysis

      Thelwall, Mike (Elsevier, 2017-11-22)
      Microsoft Academic is a free academic search engine and citation index that is similar to Google Scholar but can be automatically queried. Its data is potentially useful for bibliometric analysis if it is possible to search effectively for individual journal articles. This article compares different methods to find journal articles in its index by searching for a combination of title, authors, publication year and journal name and uses the results for the widest published correlation analysis of Microsoft Academic citation counts for journal articles so far. Based on 126,312 articles from 323 Scopus subfields in 2012, the optimal strategy to find articles with DOIs is to search for them by title and filter out those with incorrect DOIs. This finds 90% of journal articles. For articles without DOIs, the optimal strategy is to search for them by title and then filter out matches with dissimilar metadata. This finds 89% of journal articles, with an additional 1% incorrect matches. The remaining articles seem to be mainly not indexed by Microsoft Academic or indexed with a different language version of their title. From the matches, Scopus citation counts and Microsoft Academic counts have an average Spearman correlation of 0.95, with the lowest for any single field being 0.63. Thus, Microsoft Academic citation counts are almost universally equivalent to Scopus citation counts for articles that are not recent but there are national biases in the results.
    • Monitoring Twitter strategies to discover resonating topics: The case of the UNDP

      Thelwall, Mike; Cugelman, Brian (EPI - El Profesional de la información., 2017-08-02)
      Many organizations use social media to attract supporters, disseminate information and advocate change. Services like Twitter can theoretically deliver messages to a huge audience that would be difficult to reach by other means. This article introduces a method to monitor an organization’s Twitter strategy and applies it to tweets from United Nations Development Programme (UNDP) accounts. The Resonating Topic Method uses automatic analyses with free software to detect successful themes within the organization’s tweets, categorizes the most successful tweets, and analyses a comparable organization to identify new successful strategies. In the case of UNDP tweets from November 2014 to March 2015, the results confirm the importance of official social media accounts as well as those of high profile individuals and general supporters. Official accounts seem to be more successful at encouraging action, which is a critical aspect of social media campaigning. An analysis of Oxfam found a successful social media approach that the UNDP had not adopted, showing the value of analyzing other organizations to find potential strategy gaps.
    • Motivations for academic web site interlinking: evidence for the Web as a novel source of information on informal scholarly communication

      Wilkinson, David; Harries, Gareth; Thelwall, Mike; Price, Liz (Sage, 2003)
      The need to understand authors’ motivations for creating links between university web sites is addressed by a survey of a random collection of 414 such links from the ac.uk domain. A classification scheme was created and applied to this collection. Obtaining inter-classifier agreement as to the single main link creation cause was very difficult because of multiple potential motivations and the fluidity of genre on the Web. Nevertheless, it was clear that, whilst the vast majority, over 90%, was created for broadly scholarly reasons, only two were equivalent to journal citations. It is concluded that academic web link metrics will be dominated by a range of informal types of scholarly communication. Since formal communication can be extensively studied through citation analysis, this provides an exciting new window through which to investigate a facet of a previously obscured type of communication activity.
    • Multi-document summarization of news articles using an event-based framework

      Ou, Shiyan; Khoo, Christopher S.G.; Goh, Dion H. (Emerald, 2006)
      Purpose – The purpose of this research is to develop a method for automatic construction of multi-document summaries of sets of news articles that might be retrieved by a web search engine in response to a user query. Design/methodology/approach – Based on the cross-document discourse analysis, an event-based framework is proposed for integrating and organizing information extracted from different news articles. It has a hierarchical structure in which the summarized information is presented at the top level and more detailed information given at the lower levels. A tree-view interface was implemented for displaying a multi-document summary based on the framework. A preliminary user evaluation was performed by comparing the framework-based summaries against the sentence-based summaries. Findings – In a small evaluation, all the human subjects preferred the framework-based summaries to the sentence-based summaries. It indicates that the event-based framework is an effective way to summarize a set of news articles reporting an event or a series of relevant events. Research limitations/implications – Limited to event-based news articles only, not applicable to news critiques and other kinds of news articles. A summarization system based on the event-based framework is being implemented. Practical implications – Multi-document summarization of news articles can adopt the proposed event-based framework. Originality/value – An event-based framework for summarizing sets of news articles was developed and evaluated using a tree-view interface for displaying such summaries.
    • Multiword units in machine translation and translation technology

      Ruslan, Mitkov; Monti, Johanna; Corpas Pastor, Gloria; Seretan, Violeta (John Benjamins, 2018-07-20)
      The correct interpretation of Multiword Units (MWUs) is crucial to many applications in Natural Language Processing but is a challenging and complex task. In recent years, the computational treatment of MWUs has received considerable attention but we believe that there is much more to be done before we can claim that NLP and Machine Translation (MT) systems process MWUs successfully. In this chapter, we present a survey of the field with particular reference to Machine Translation and Translation Technology.
    • Mutual terminology extraction using a statistical framework

      Ha, Le An; Mitkov, Ruslan; Pastor, Gloria Corpas (Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), 2008-06-16)
      In this paper, we explore a statistical framework for mutual bilingual terminology extraction. We propose three probabilistic models to assess the proposition that automatic alignment can play an active role in bilingual terminology extraction and translate it into mutual bilingual terminology extraction. The results indicate that such models are valid and can show that mutual bilingual terminology extraction is indeed a viable approach.
    • National Scientific Performance Evolution Patterns: Retrenchment, Successful Expansion, or Overextension

      Thelwall, Mike; Levitt, Jonathan M. (Wiley-Blackwell, 2017-11-17)
      National governments would like to preside over an expanding and increasingly high impact science system but are these two goals largely independent or closely linked? This article investigates the relationship between changes in the share of the world’s scientific output and changes in relative citation impact for 2.6 million articles from 26 fields in the 25 countries with the most Scopus-indexed journal articles from 1996 to 2015. There is a negative correlation between expansion and relative citation impact but their relationship varies. China, Spain, Australia, and Poland were successful overall across the 26 fields, expanding both their share of the world’s output and its relative citation impact, whereas Japan, France, Sweden and Israel had decreased shares and relative citation impact. In contrast, the USA, UK, Germany, Italy, Russia, Netherlands, Switzerland, Finland, and Denmark all enjoyed increased relative citation impact despite a declining share of publications. Finally, India, South Korea, Brazil, Taiwan, and Turkey all experienced sustained expansion but a recent fall in relative citation impact. These results may partly reflect changes in the coverage of Scopus and the selection of fields.
    • New directions in the study of family names

      Hanks, Patrick; Boullón Agrelo, Ana Isabel (Consello da Cultura Galega, 2018-12-28)
      This paper explores and explains recent radical developments in resources and methodology for studying the origins, cultural associations, and histories of family names (also called ‘surnames’). It summarizes the current state of the art and outlines new resources and procedures that are now becoming available. It shows how such innovations can enable the correction of errors in previous work and improve the accuracy of dictionaries of family names, with a focus on the English-speaking world. Developments such as the digitization of archives are having a profound effect, not only on the interpretation and understanding of traditional, ‘established’ family names and their histories, but also of names in other languages and other cultures. There are literally millions of different family names in the world today, many of which have never been studied at all. What are good criteria for selection of entries in a dictionary of family names, and what can be said about them? What is the nature of the evidence? How stable (or how variable) are family names over time? What are the effects of factors such as migration? What is the relationship between family names and geographical locations, given that people can and do move around? What is the relationship between traditional philological and historical approaches to the subject and statistical analysis of newly available digitized data? The paper aims to contribute to productive discussion of such questions.
    • New versions of PageRank employing alternative Web document models

      Thelwall, Mike; Vaughan, Liwen (Emerald Group Publishing Limited, 2004)
      Introduces several new versions of PageRank (the link based Web page ranking algorithm), based on an information science perspective on the concept of the Web document. Although the Web page is the typical indivisible unit of information in search engine results and most Web information retrieval algorithms, other research has suggested that aggregating pages based on directories and domains gives promising alternatives, particularly when Web links are the object of study. The new algorithms introduced based on these alternatives were used to rank four sets of Web pages. The ranking results were compared with human subjects’ rankings. The results of the tests were somewhat inconclusive: the new approach worked well for the set that includes pages from different Web sites; however, it does not work well in ranking pages that are from the same site. It seems that the new algorithms may be effective for some tasks but not for others, especially when only low numbers of links are involved or the pages to be ranked are from the same site or directory.