• Understanding the geographical development of social movements: a web-link analysis of Slow Food

      HENDRIKX, BAS; DORMANS, STEFAN; LAGENDIJK, ARNOUD; Thelwall, Mike; Geography, Planning and Environment; Radboud University Nijmegen; Netherlands; Institute for Management Research; Radboud University Nijmegen; Geography, Planning and Environment; Radboud University Nijmegen; Statistical Cybermetrics Research Group; University of Wolverhampton (Wiley-Blackwell, 2016-11-29)
      Slow Food (SF) is a global, grassroots movement aimed at enhancing and sustaining local food cultures and traditions worldwide. Since its establishment in the 1980s, Slow Food groups have emerged across the world and embedded in a wide range of different contexts. In this article, we explain how the movement, as a diverse whole, is being shaped by complex dynamics existing between grassroots flexibilities and emerging drives for movement coherence and harmonization. Unlike conventional studies on social movements, our approach helps one to understand transnational social movements as being simultaneously coherent and diverse bodies of collective action. Drawing on work in the fields of relational geography, assemblage theory and webometric research, we develop an analytical strategy that navigates and maps the entire Slow Food movement by exploring its ‘double articulation’ between the material-connective and ideational-expressive. Focusing on representations of this connectivity and articulation on the internet, we combine methodologies of computation research (webometrics) with more qualitative forms of (web) discourse analysis to achieve this. Our results point to the significance of particular networks and nodal points that support such double movements, each presenting core logistical channels of the movement's operations as well as points of relay of new ideas and practices. A network-based analysis of ‘double articulation’ thus shows how the co-evolution of ideas and material practices cascades into major trends without having to rely on a ‘grand', singular explanation of a movement's development.
    • Using gaze data to predict multiword expressions

      Rohanian, Omid; Taslimipoor, Shiva; Yaneva, Victoria; Ha, Le An (INCOMA Ltd, 2017-09-01)
      In recent years gaze data has been increasingly used to improve and evaluate NLP models due to the fact that it carries information about the cognitive processing of linguistic phenomena. In this paper we conduct a preliminary study towards the automatic identification of multiword expressions based on gaze features from native and non-native speakers of English. We report comparisons between a part-ofspeech (POS) and frequency baseline to: i) a prediction model based solely on gaze data and ii) a combined model of gaze data, POS and frequency. In spite of the challenging nature of the task, best performance was achieved by the latter. Furthermore, we explore how the type of gaze data (from native versus non-native speakers) affects the prediction, showing that data from the two groups is discriminative to an equal degree. Finally, we show that late processing measures are more predictive than early ones, which is in line with previous research on idioms and other formulaic structures.
    • Using natural language processing to predict item response times and improve test construction

      Baldwin, Peter; Yaneva, Victoria; Mee, Janet; Clauser, Brian E; Ha, Le An (Wiley, 2020-02-24)
      In this article, it is shown how item text can be represented by (a) 113 features quantifying the text's linguistic characteristics, (b) 16 measures of the extent to which an information‐retrieval‐based automatic question‐answering system finds an item challenging, and (c) through dense word representations (word embeddings). Using a random forests algorithm, these data then are used to train a prediction model for item response times and predicted response times then are used to assemble test forms. Using empirical data from the United States Medical Licensing Examination, we show that timing demands are more consistent across these specially assembled forms than across forms comprising randomly‐selected items. Because an exam's timing conditions affect examinee performance, this result has implications for exam fairness whenever examinees are compared with each other or against a common standard.
    • Using semi-automatic compiled corpora for medical terminology and vocabulary building in the healthcare domain

      Gutiérrez Florido, Rut; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam (Université Paris 13, 2013-10-28)
      English, Spanish and German are amongst the most spoken languages in Europe. Thus it is likely that patients from one EU member state seeking medical treatment in another will speak or understand one of these. However, there is a lack of resources to teach efficient communication between patients and medics. To combat this, the TELL-ME project will provide a fully targeted package. This includes learning materials for Medical English, Spanish and German aimed at medical staff already in the other countries or undertaking cross-border mobility. The learning process will be supported by computer-aided tools based on corpora. For this reason, in this workshop we present the semi-automatic compilation of the TELL-ME corpus, whose function is to support the e-learning platform of the TELL-ME project, together with its self-assessment exercises emphasising the importance of specialised terminology in the acquisition of communicative and language skills.
    • Verbal multiword expressions for identification of metaphor

      Rohanian, Omid; Rei, Marek; Taslimipoor, Shiva; Ha, Le (ACL, 2020-07-06)
      Metaphor is a linguistic device in which a concept is expressed by mentioning another. Identifying metaphorical expressions, therefore, requires a non-compositional understanding of semantics. Multiword Expressions (MWEs), on the other hand, are linguistic phenomena with varying degrees of semantic opacity and their identification poses a challenge to computational models. This work is the first attempt at analysing the interplay of metaphor and MWEs processing through the design of a neural architecture whereby classification of metaphors is enhanced by informing the model of the presence of MWEs. To the best of our knowledge, this is the first “MWE-aware” metaphor identification system paving the way for further experiments on the complex interactions of these phenomena. The results and analyses show that this proposed architecture reach state-of-the-art on two different established metaphor datasets.
    • The way to analyse ‘way’: A case study in word-specific local grammar

      Hanks, Patrick; Može, Sara (Oxford Academic, 2019-02-11)
      Traditionally, dictionaries are meaning-driven—that is, they list different senses (or supposed senses) of each word, but do not say much about the phraseology that distinguishes one sense from another. Grammars, on the other hand, are structure-driven: they attempt to describe all possible structures of a language, but say little about meaning, phraseology, or collocation. In both disciplines during the 20th century, the practice of inventing evidence rather than discovering it led to intermittent and unpredictable distortions of fact. Since 1987, attempts have been made in both lexicography (Cobuild) and syntactic theory (pattern grammar, construction grammar) to integrate meaning and phraseology. Corpora now provide empirical evidence on a large scale for lexicosyntactic description, but there is still a long way to go. Many cherished beliefs must be abandoned before a synthesis between empirical lexical analysis and grammatical theory can be achieved. In this paper, by empirical analysis of just one word (the noun way), we show how corpus evidence can be used to tackle the complexities of lexical and constructional meaning, providing new insights into the lexis-grammar interface.
    • Web citations in patents: Evidence of technological impact?

      Enrique Orduna-Malea; Thelwall, Mike; Kousha, Kayvan; EC3 Research Group, Universitat Politècnica de València (UPV), 46022 Valencia, Spain (Wiley Blackwell, 2017-07-17)
      Patents sometimes cite web pages either as general background to the problem being addressed or to identify prior publications that will limit the scope of the patent granted. Counts of the number of patents citing an organisation’s website may therefore provide an indicator of its technological capacity or relevance. This article introduces methods to extract URL citations from patents and evaluates the usefulness of counts of patent web citations as a technology indicator. An analysis of patents citing 200 US universities or 177 UK universities found computer science and engineering departments to be frequently cited, as well as research-related web pages, such as Wikipedia, YouTube or Internet Archive. Overall, however, patent URL citations seem to be frequent enough to be useful for ranking major US and the top few UK universities if popular hosted subdomains are filtered out, but the hit count estimates on the first search engine results page should not be relied upon for accuracy.
    • Web impact factors and search engine coverage

      Thelwall, Mike (MCB UP Ltd, 2000)
      Search engines index only a proportion of the web and this proportion is not determined randomly but by following algorithms that take into account the properties that impact factors measure. A survey was conducted in order to test the coverage of search engines and to decide whether their partial coverage is indeed an obstacle to using them to calculate web impact factors. The results indicate that search engine coverage, even of large national domains is extremely uneven and would be likely to lead to misleading calculations.
    • Web issue analysis: an integrated water resource management case study

      Thelwall, Mike; Vann, Katie; Fairclough, Ruth (Wiley InterScience, 2006)
      In this article Web issue analysis is introduced as a new technique to investigate an issue as reflected on the Web. The issue chosen, integrated water resource management (IWRM), is a United Nations-initiated paradigm for managing water resources in an international context, particularly in developing nations. As with many international governmental initiatives, there is a considerable body of online information about it: 41,381 hypertext markup language (HTML) pages and 28,735 PDF documents mentioning the issue were downloaded. A page uniform resource locator (URL) and link analysis revealed the international and sectoral spread of IWRM. A noun and noun phrase occurrence analysis was used to identify the issues most commonly discussed, revealing some unexpected topics such as private sector and economic growth. Although the complexity of the methods required to produce meaningful statistics from the data is disadvantageous to easy interpretation, it was still possible to produce data that could be subject to a reasonably intuitive interpretation. Hence Web issue analysis is claimed to be a useful new technique for information science.
    • Web log file analysis: backlinks and queries

      Thelwall, Mike (MCB UP Ltd, 2001)
      As has been described else where, web log files are a useful source of information about visitor site use, navigation behaviour, and, to some extent, demographics. But log files can also reveal the existence of both web pages and search engine queries that are sources of new visitors.This study extracts such information from a single web log files and uses it to illustrate its value, not only to th site owner but also to those interested in investigating the online behaviour of web users.
    • Web users with autism: eye tracking evidence for differences

      Eraslan, Sukru; Yaneva, Victoria; Yesilada, Yeliz; Harper, Simon (Taylor and Francis, 2018-12-11)
      Anecdotal evidence suggests that people with autism may have different processing strategies when accessing the web. However, limited empirical evidence is available to support this. This paper presents an eye tracking study with 18 participants with high-functioning autism and 18 neurotypical participants to investigate the similarities and differences between these two groups in terms of how they search for information within web pages. According to our analysis, people with autism are likely to be less successful in completing their searching tasks. They also have a tendency to look at more elements on web pages and make more transitions between the elements in comparison to neurotypical people. In addition, they tend to make shorter but more frequent fixations on elements which are not directly related to a given search task. Therefore, this paper presents the first empirical study to investigate how people with autism differ from neurotypical people when they search for information within web pages based on an in-depth statistical analysis of their gaze patterns.
    • What jihad questions do Muslims ask?

      Emad Mohamed; Bakinaz Abdalla (Indiana University Press, 2017-05-01)
      Using digital humanities tools and methods, we extract, classify, and analyze 1,006 jihad fatwas from a corpus of 164,000 online fatwas. We use the questions and page hits to rank clusters of fatwas in order to discover what jihad questions Muslims ask, what jihad issues interest Muslims the most, and what the targets of jihad may be. We focus more on the questions than the answers, since it is the questions that give us a window into what may be called the “Muslim collective mind.” The results show that jihad questions are interwoven with several key topics, from performance of prayers to expiation for homosexuality. While the Prophet Muhammad's military expeditions were the most asked about and most viewed category, since they provide a model of what jihad is, the second most important category was concubinage. When there was a specific target, Jews were found in 73% of the questions.
    • What matters more: the size of the corpora or their quality? The case of automatic translation of multiword expressions using comparable corpora.

      Mitkov, Ruslan; Taslimipoor, Shiva (John Benjamins, 2016)
      This study investigates (and compares) the impact of the size and the similarity/quality of comparable corpora on the specific task of extracting translation equivalents of verb-noun collocations from such corpora. The comprehensive evaluation of different configurations of English and Spanish corpora sheds some light on the more general and perennial question: what matters more – the quantity or quality of corpora?
    • When are readership counts as useful as citation counts? Scopus versus Mendeley for LIS journals

      Thelwall, Mike; Maflahi, Nabeil (Wiley-Blackwell, 2014-06)
      In theory, articles can attract readers on the social reference sharing site Mendeley before they can attract citations, so Mendeley altmetrics could provide early indications of article impact. This article investigates the influence of time on the number of Mendeley readers of an article through a theoretical discussion and an investigation into the relationship between counts of readers of, and citations to, 4 general library and information science (LIS) journals. For this discipline, it takes about 7 years for articles to attract as many Scopus citations as Mendeley readers, and after this the Spearman correlation between readers and citers is stable at about 0.6 for all years. This suggests that Mendeley readership counts may be useful impact indicators for both newer and older articles. The lack of dates for individual Mendeley article readers and an unknown bias toward more recent articles mean that readership data should be normalized individually by year, however, before making any comparisons between articles published in different years.
    • Which academic subjects have most online impact? A pilot study and a new classification process

      Thelwall, Mike; Vaughan, Liwen; Cothey, Viv; Li, Xuemei; Smith, Alastair G. (MCB UP Ltd, 2003)
      The use of the Web by academic researchers is discipline-dependent and highly variable. It is increasingly central for sharing information, disseminating results and publicising research projects. This pilot study seeks to identify the subjects that have the most impact on the Web, and look for national differences in online subject visibility. The highest impact sites were from computing, but there were major national differences in the impact of engineering and technology sites. Another difference was that Taiwan had more high impact non-academic sites hosted by universities. As a pilot study, the classification process itself was also investigated and the problems of applying subject classification to academic Web sites discussed. The study draws out a number of issues in this regard, having no simple solutions and point to the need to interpret the results with caution.
    • Which US and European Higher Education Institutions are visible in ResearchGate and what affects their RG Score?

      Lepori, Benedetto; Thelwall, Michael; Hoorani, Bareerah Hafeez (Elsevier, 2018-07-19)
      While ResearchGate has become the most popular academic social networking site in terms of regular users, not all institutions have joined and the scores it assigns to academics and institutions are controversial. This paper assesses the presence in ResearchGate of higher education institutions in Europe and the US in 2017, and the extent to which institutional ResearchGate Scores reflect institutional academic impact. Most of the 2258 European and 4355 US higher educational institutions included in the sample had an institutional ResearchGate profile, with near universal coverage for PhD-awarding institutions found in the Web of Science (WoS). For non-PhD awarding institutions that did not publish, size (number of staff members) was most associated with presence in ResearchGate. For PhD-awarding institutions in WoS, presence in RG was strongly related to the number of WoS publications. In conclusion, a) institutional RG scores reflect research volume more than visibility and b) this indicator is highly correlated to the number of WoS publications. Hence, the value of RG Scores for institutional comparisons is limited.
    • Why do papers have many Mendeley readers but few Scopus-indexed citations and vice versa?

      Thelwall, Mike (Sage, 2015-07-14)
      Counts of citations to academic articles are widely used as indicators of their scholarly impact. In addition, alternative indicators derived from social websites have been proposed to cover some of the shortcomings of citation counts. The most promising such indicator is counts of readers of an article in the social reference sharing site Mendeley. Although Mendeley reader counts tend to correlate strongly and positively with citation counts within scientific fields, an understanding of causes of citation-reader anomalies is needed before Mendeley reader counts can be used with confidence as indicators. In response, this article proposes a list reasons for anomalies based upon an analysis of articles that are highly cited but have few Mendeley readers, or vice versa. The results show that there are both technical and legitimate reasons for differences, with the latter including communities that use research but do not cite it in Scopus-indexed publications or do not use Mendeley. The results also suggest that the lower of the two values (citation counts, reader counts) tends to underestimate of the impact of an article and so taking the maximum is a reasonable strategy for a combined impact indicator.
    • WLV at SemEval-2018 task 3: Dissecting tweets in search of irony

      Rohanian, Omid; Taslimipoor, Shiva; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics, 2018-06-05)
      This paper describes the systems submitted to SemEval 2018 Task 3 “Irony detection in English tweets” for both subtasks A and B. The first system leveraging a combination of sentiment, distributional semantic, and text surface features is ranked third among 44 teams according to the official leaderboard of the subtask A. The second system with slightly different representation of the features ranked ninth in subtask B. We present a method that entails decomposing tweets into separate parts. Searching for contrast within the constituents of a tweet is an integral part of our system. We embrace an extensive definition of contrast which leads to a vast coverage in detecting ironic content.
    • Wolves at SemEval-2018 task 10: Semantic discrimination based on knowledge and association

      Taslimipoor, Shiva; Rohanian, Omid; Ha, Le An; Corpas Pastor, Gloria; Mitkov, Ruslan (Association for Computational Linguistics, 2018-06)
      This paper describes the system submitted to SemEval 2018 shared task 10 ‘Capturing Discriminative Attributes’. We use a combination of knowledge-based and co-occurrence features to capture the semantic difference between two words in relation to an attribute. We define scores based on association measures, ngram counts, word similarity, and ConceptNet relations. The system is ranked 4th (joint) on the official leaderboard of the task.
    • YouTube Science Channel Video Presenters and Comments: Female Friendly or Vestiges of Sexism?

      Mas-Bleda, Amalia; Thelwall, Mike (Emerald, 2018-01-15)
      Purpose: This paper analyses popular YouTube science video channels for evidence of attractiveness to a female audience. Design/methodology/approach: The influence of presenter gender and commenter sentiment towards males and females is investigated for 50 YouTube science channels with a combined view-count approaching ten billion. This is cross-referenced with commenter gender as a proxy for audience gender. Findings: The ratio of male to female commenters varies between 1 and 39 to 1, but the low proportions of females seem to be due to the topic or presentation style rather than the gender of the presenter or the attitudes of the commenters. Although male commenters were more hostile to other males than to females, a few posted inappropriate sexual references that may alienate females. Research limitations: Comments reflect a tiny and biased sample of YouTube science channel viewers and so their analysis provides weak evidence. Practical implications: Sexist behaviour in YouTube commenting needs to be combatted but the data suggests that gender balance in online science presenters should not be the primary concern of channel owners. Originality/value: This is the largest scale analysis of gender in YouTube science communication.