• The first Automatic Translation Memory Cleaning Shared Task

      Barbu, Eduard; Parra Escartín, Carla; Bentivogli, Luisa; Negri, Matteo; Turchi, Marco; Orasan, Constantin; Federico, Marcello (Springer, 2017-01-21)
      This paper reports on the organization and results of the rst Automatic Translation Memory Cleaning Shared Task. This shared task is aimed at nding automatic ways of cleaning translation memories (TMs) that have not been properly curated and thus include incorrect translations. As a follow up of the shared task, we also conducted two surveys, one targeting the teams participating in the shared task, and the other one targeting professional translators. While the researchers-oriented survey aimed at gathering information about the opinion of participants on the shared task, the translators-oriented survey aimed to better understand what constitutes a good TM unit and inform decisions that will be taken in future editions of the task. In this paper, we report on the process of data preparation and the evaluation of the automatic systems submitted, as well as on the results of the collected surveys.
    • The influence of highly cited papers on field normalised indicators

      Thelwall, Mike (Springer, 2019-01-05)
      Field normalised average citation indicators are widely used to compare countries, universities and research groups. The most common variant, the Mean Normalised Citation Score (MNCS), is known to be sensitive to individual highly cited articles but the extent to which this is true for a log-based alternative, the Mean Normalised Log Citation Score (MNLCS), is unknown. This article investigates country-level highly cited outliers for MNLCS and MNCS for all Scopus articles from 2013 and 2012. The results show that MNLCS is influenced by outliers, as measured by kurtosis, but at a much lower level than MNCS. The largest outliers were affected by the journal classifications, with the Science-Metrix scheme producing much weaker outliers than the internal Scopus scheme. The high Scopus outliers were mainly due to uncitable articles reducing the average in some humanities categories. Although outliers have a numerically small influence on the outcome for individual countries, changing indicator or classification scheme influences the results enough to affect policy conclusions drawn from them. Future field normalised calculations should therefore explicitly address the influence of outliers in their methods and reporting.
    • The research production of nations and departments: A statistical model for the share of publications

      Thelwall, Mike (Elsevier, 2017-11-04)
      Policy makers and managers sometimes assess the share of research produced by a group (country, department, institution). This takes the form of the percentage of publications in a journal, field or broad area that has been published by the group. This quantity is affected by essentially random influences that obscure underlying changes over time and differences between groups. A model of research production is needed to help identify whether differences between two shares indicate underlying differences. This article introduces a simple production model for indicators that report the share of the world’s output in a journal or subject category, assuming that every new article has the same probability to be authored by a given group. With this assumption, confidence limits can be calculated for the underlying production capability (i.e., probability to publish). The results of a time series analysis of national contributions to 36 large monodisciplinary journals 1996-2016 are broadly consistent with this hypothesis. Follow up tests of countries and institutions in 26 Scopus subject categories support the conclusions but highlight the importance of ensuring consistent subject category coverage.
    • Three kinds of semantic resonance

      Hanks, Patrick (Ivane Javakhishvili Tbilisi University Press, 2016-09-06)
      This presentation suggests some reasons why lexicographers of the future will need to pay more attention to phraseology and non-literal meaning. It argues that not only do words have literal meaning, but also that much meaning is non-literal, being lexical, i.e. metaphorical or figurative, experiential, or intertextual.
    • Three practical field normalised alternative indicator formulae for research evaluation

      Thelwall, Mike (Elsevier, 2017-02)
      Although altmetrics and other web-based alternative indicators are now commonplace in publishers’ websites, they can be difficult for research evaluators to use because of the time or expense of the data, the need to benchmark in order to assess their values, the high proportion of zeros in some alternative indicators, and the time taken to calculate multiple complex indicators. These problems are addressed here by (a) a field normalisation formula, the Mean Normalised Log-transformed Citation Score (MNLCS) that allows simple confidence limits to be calculated and is similar to a proposal of Lundberg, (b) field normalisation formulae for the proportion of cited articles in a set, the Equalised Mean-based Normalised Proportion Cited (EMNPC) and the Mean-based Normalised Proportion Cited (MNPC), to deal with mostly uncited data sets, (c) a sampling strategy to minimise data collection costs, and (d) free unified software to gather the raw data, implement the sampling strategy, and calculate the indicator formulae and confidence limits. The approach is demonstrated (but not fully tested) by comparing the Scopus citations, Mendeley readers and Wikipedia mentions of research funded by Wellcome, NIH, and MRC in three large fields for 2013–2016. Within the results, statistically significant differences in both citation counts and Mendeley reader counts were found even for sets of articles that were less than six months old. Mendeley reader counts were more precise than Scopus citations for the most recent articles and all three funders could be demonstrated to have an impact in Wikipedia that was significantly above the world average.
    • Three target document range metrics for university web sites

      Thelwall, Mike; Wilkinson, David (Wiley, 2003)
      Three new metrics are introduced that measure the range of use of a university Web site by its peers through different heuristics for counting links targeted at its pages. All three give results that correlate significantly with the research productivity of the target institution. The directory range model, which is based upon summing the number of distinct directories targeted by each other university, produces the most promising results of any link metric yet. Based upon an analysis of changes between models, it is suggested that range models measure essentially the same quantity as their predecessors but are less susceptible to spurious causes of multiple links and are therefore more robust.
    • Toponym detection in the bio-medical domain: A hybrid approach with deep learning

      Plum, Alistair; Ranasinghe, Tharindu; Orăsan, Constantin (RANLP, 2019-09-02)
      This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.
    • Translating English verbal collocations into Spanish: On distribution and other relevant differences related to diatopic variation

      Corpas Pastor, Gloria (John Benjamins Publishing Company, 2015-12-21)
      Language varieties should be taken into account in order to enhance fluency and naturalness of translated texts. In this paper we will examine the collocational verbal range for prima-facie translation equivalents of words like decision and dilemma, which in both languages denote the act or process of reaching a resolution after consideration, resolving a question or deciding something. We will be mainly concerned with diatopic variation in Spanish. To this end, we set out to develop a giga-token corpus-based protocol which includes a detailed and reproducible methodology sufficient to detect collocational peculiarities of transnational languages. To our knowledge, this is one of the first observational studies of this kind. The paper is organised as follows. Section 1 introduces some basic issues about the translation of collocations against the background of languages’ anisomorphism. Section 2 provides a feature characterisation of collocations. Section 3 deals with the choice of corpora, corpus tools, nodes and patterns. Section 4 covers the automatic retrieval of the selected verb + noun (object) collocations in general Spanish and the co-existing national varieties. Special attention is paid to comparative results in terms of similarities and mismatches. Section 5 presents conclusions and outlines avenues of further research.
    • Tweeting links to academic articles

      Thelwall, M.; Tsou, A.; Holmberg, K.; Haustein, S. (2013-01-01)
      Academic articles are now frequently tweeted and so Twitter seems to be a useful tool for scholars to use to help keep up with publications and discussions in their fields. Perhaps as a result of this, tweet counts are increasingly used by digital libraries and journal websites as indicators of an article's interest or impact. Nevertheless, it is not known whether tweets are typically positive, neutral or critical, or how articles are normally tweeted. These are problems for those wishing to tweet articles effectively and for those wishing to know whether tweet counts in digital libraries should be taken seriously. In response, a pilot study content analysis was conducted of 270 tweets linking to articles in four journals, four digital libraries and two DOI URLs, collected over a period of eight months in 2012. The vast majority of the tweets echoed an article title (42%) or a brief summary (41%). One reason for summarising an article seemed to be to translate it for a general audience. Few tweets explicitly praised an article and none were critical. Most tweets did not directly refer to the article author, but some did and others were clearly self-citations. In summary, tweets containing links to scholarly articles generally provide little more than publicity, and so whilst tweet counts may provide evidence of the popularity of an article, the contents of the tweets themselves are unlikely to give deep insights into scientists' reactions to publications, except perhaps in special cases.
    • Understanding the geographical development of social movements: a web-link analysis of Slow Food

      HENDRIKX, BAS; DORMANS, STEFAN; LAGENDIJK, ARNOUD; Thelwall, Mike; Geography, Planning and Environment; Radboud University Nijmegen; Netherlands; Institute for Management Research; Radboud University Nijmegen; Geography, Planning and Environment; Radboud University Nijmegen; Statistical Cybermetrics Research Group; University of Wolverhampton (Wiley-Blackwell, 2016-11-29)
      Slow Food (SF) is a global, grassroots movement aimed at enhancing and sustaining local food cultures and traditions worldwide. Since its establishment in the 1980s, Slow Food groups have emerged across the world and embedded in a wide range of different contexts. In this article, we explain how the movement, as a diverse whole, is being shaped by complex dynamics existing between grassroots flexibilities and emerging drives for movement coherence and harmonization. Unlike conventional studies on social movements, our approach helps one to understand transnational social movements as being simultaneously coherent and diverse bodies of collective action. Drawing on work in the fields of relational geography, assemblage theory and webometric research, we develop an analytical strategy that navigates and maps the entire Slow Food movement by exploring its ‘double articulation’ between the material-connective and ideational-expressive. Focusing on representations of this connectivity and articulation on the internet, we combine methodologies of computation research (webometrics) with more qualitative forms of (web) discourse analysis to achieve this. Our results point to the significance of particular networks and nodal points that support such double movements, each presenting core logistical channels of the movement's operations as well as points of relay of new ideas and practices. A network-based analysis of ‘double articulation’ thus shows how the co-evolution of ideas and material practices cascades into major trends without having to rely on a ‘grand', singular explanation of a movement's development.
    • Using gaze data to predict multiword expressions

      Rohanian, Omid; Taslimipoor, Shiva; Yaneva, Victoria; Ha, Le An (INCOMA Ltd, 2017-09-01)
      In recent years gaze data has been increasingly used to improve and evaluate NLP models due to the fact that it carries information about the cognitive processing of linguistic phenomena. In this paper we conduct a preliminary study towards the automatic identification of multiword expressions based on gaze features from native and non-native speakers of English. We report comparisons between a part-ofspeech (POS) and frequency baseline to: i) a prediction model based solely on gaze data and ii) a combined model of gaze data, POS and frequency. In spite of the challenging nature of the task, best performance was achieved by the latter. Furthermore, we explore how the type of gaze data (from native versus non-native speakers) affects the prediction, showing that data from the two groups is discriminative to an equal degree. Finally, we show that late processing measures are more predictive than early ones, which is in line with previous research on idioms and other formulaic structures.
    • The way to analyse ‘way’: A case study in word-specific local grammar

      Hanks, Patrick; Može, Sara (Oxford Academic, 2019-02-11)
      Traditionally, dictionaries are meaning-driven—that is, they list different senses (or supposed senses) of each word, but do not say much about the phraseology that distinguishes one sense from another. Grammars, on the other hand, are structure-driven: they attempt to describe all possible structures of a language, but say little about meaning, phraseology, or collocation. In both disciplines during the 20th century, the practice of inventing evidence rather than discovering it led to intermittent and unpredictable distortions of fact. Since 1987, attempts have been made in both lexicography (Cobuild) and syntactic theory (pattern grammar, construction grammar) to integrate meaning and phraseology. Corpora now provide empirical evidence on a large scale for lexicosyntactic description, but there is still a long way to go. Many cherished beliefs must be abandoned before a synthesis between empirical lexical analysis and grammatical theory can be achieved. In this paper, by empirical analysis of just one word (the noun way), we show how corpus evidence can be used to tackle the complexities of lexical and constructional meaning, providing new insights into the lexis-grammar interface.
    • Web citations in patents: Evidence of technological impact?

      Enrique Orduna-Malea; Thelwall, Mike; Kousha, Kayvan; EC3 Research Group, Universitat Politècnica de València (UPV), 46022 Valencia, Spain (Wiley Blackwell, 2017-07-17)
      Patents sometimes cite web pages either as general background to the problem being addressed or to identify prior publications that will limit the scope of the patent granted. Counts of the number of patents citing an organisation’s website may therefore provide an indicator of its technological capacity or relevance. This article introduces methods to extract URL citations from patents and evaluates the usefulness of counts of patent web citations as a technology indicator. An analysis of patents citing 200 US universities or 177 UK universities found computer science and engineering departments to be frequently cited, as well as research-related web pages, such as Wikipedia, YouTube or Internet Archive. Overall, however, patent URL citations seem to be frequent enough to be useful for ranking major US and the top few UK universities if popular hosted subdomains are filtered out, but the hit count estimates on the first search engine results page should not be relied upon for accuracy.
    • Web impact factors and search engine coverage

      Thelwall, Mike (MCB UP Ltd, 2000)
      Search engines index only a proportion of the web and this proportion is not determined randomly but by following algorithms that take into account the properties that impact factors measure. A survey was conducted in order to test the coverage of search engines and to decide whether their partial coverage is indeed an obstacle to using them to calculate web impact factors. The results indicate that search engine coverage, even of large national domains is extremely uneven and would be likely to lead to misleading calculations.
    • Web issue analysis: an integrated water resource management case study

      Thelwall, Mike; Vann, Katie; Fairclough, Ruth (Wiley InterScience, 2006)
      In this article Web issue analysis is introduced as a new technique to investigate an issue as reflected on the Web. The issue chosen, integrated water resource management (IWRM), is a United Nations-initiated paradigm for managing water resources in an international context, particularly in developing nations. As with many international governmental initiatives, there is a considerable body of online information about it: 41,381 hypertext markup language (HTML) pages and 28,735 PDF documents mentioning the issue were downloaded. A page uniform resource locator (URL) and link analysis revealed the international and sectoral spread of IWRM. A noun and noun phrase occurrence analysis was used to identify the issues most commonly discussed, revealing some unexpected topics such as private sector and economic growth. Although the complexity of the methods required to produce meaningful statistics from the data is disadvantageous to easy interpretation, it was still possible to produce data that could be subject to a reasonably intuitive interpretation. Hence Web issue analysis is claimed to be a useful new technique for information science.
    • Web log file analysis: backlinks and queries

      Thelwall, Mike (MCB UP Ltd, 2001)
      As has been described else where, web log files are a useful source of information about visitor site use, navigation behaviour, and, to some extent, demographics. But log files can also reveal the existence of both web pages and search engine queries that are sources of new visitors.This study extracts such information from a single web log files and uses it to illustrate its value, not only to th site owner but also to those interested in investigating the online behaviour of web users.
    • Web users with autism: eye tracking evidence for differences

      Eraslan, Sukru; Yaneva, Victoria; Yesilada, Yeliz; Harper, Simon (Taylor and Francis, 2018-12-11)
      Anecdotal evidence suggests that people with autism may have different processing strategies when accessing the web. However, limited empirical evidence is available to support this. This paper presents an eye tracking study with 18 participants with high-functioning autism and 18 neurotypical participants to investigate the similarities and differences between these two groups in terms of how they search for information within web pages. According to our analysis, people with autism are likely to be less successful in completing their searching tasks. They also have a tendency to look at more elements on web pages and make more transitions between the elements in comparison to neurotypical people. In addition, they tend to make shorter but more frequent fixations on elements which are not directly related to a given search task. Therefore, this paper presents the first empirical study to investigate how people with autism differ from neurotypical people when they search for information within web pages based on an in-depth statistical analysis of their gaze patterns.
    • What jihad questions do Muslims ask?

      Emad Mohamed; Bakinaz Abdalla (Indiana University Press, 2017-05-01)
      Using digital humanities tools and methods, we extract, classify, and analyze 1,006 jihad fatwas from a corpus of 164,000 online fatwas. We use the questions and page hits to rank clusters of fatwas in order to discover what jihad questions Muslims ask, what jihad issues interest Muslims the most, and what the targets of jihad may be. We focus more on the questions than the answers, since it is the questions that give us a window into what may be called the “Muslim collective mind.” The results show that jihad questions are interwoven with several key topics, from performance of prayers to expiation for homosexuality. While the Prophet Muhammad's military expeditions were the most asked about and most viewed category, since they provide a model of what jihad is, the second most important category was concubinage. When there was a specific target, Jews were found in 73% of the questions.
    • What matters more: the size of the corpora or their quality? The case of automatic translation of multiword expressions using comparable corpora.

      Mitkov, Ruslan; Taslimipoor, Shiva (John Benjamins, 2016)
      This study investigates (and compares) the impact of the size and the similarity/quality of comparable corpora on the specific task of extracting translation equivalents of verb-noun collocations from such corpora. The comprehensive evaluation of different configurations of English and Spanish corpora sheds some light on the more general and perennial question: what matters more – the quantity or quality of corpora?
    • When are readership counts as useful as citation counts? Scopus versus Mendeley for LIS journals

      Thelwall, Mike; Maflahi, Nabeil (Wiley-Blackwell, 2014-06)
      In theory, articles can attract readers on the social reference sharing site Mendeley before they can attract citations, so Mendeley altmetrics could provide early indications of article impact. This article investigates the influence of time on the number of Mendeley readers of an article through a theoretical discussion and an investigation into the relationship between counts of readers of, and citations to, 4 general library and information science (LIS) journals. For this discipline, it takes about 7 years for articles to attract as many Scopus citations as Mendeley readers, and after this the Spearman correlation between readers and citers is stable at about 0.6 for all years. This suggests that Mendeley readership counts may be useful impact indicators for both newer and older articles. The lack of dates for individual Mendeley article readers and an unknown bias toward more recent articles mean that readership data should be normalized individually by year, however, before making any comparisons between articles published in different years.