• Can museums find male or female audiences online with YouTube?

      Thelwall, Michael (Emerald, 2018-08-31)
      Purpose: This article investigates if and why audience gender ratios vary between museum YouTube channels, including for museums of the same type. Design/methodology/approach: Gender ratios were examined for public comments on YouTube videos from 50 popular museums in English-speaking nations. Terms that were more frequently used by males or females in comments were also examined for gender differences. Findings: The ratio of female to male YouTube commenters varies almost a hundredfold between museums. Some of the difference could be explained by gendered interests in museum themes (e.g., military, art) but others were due to the topics chosen for online content and could address a gender minority audience. Practical implications: Museums can attract new audiences online with YouTube videos that target outside their expected demographics. Originality/value: This is the first analysis of YouTube audience gender for museums.
    • Can Social News Websites Pay for Content and Curation? The SteemIt Cryptocurrency Model

      Thelwall, Mike (Sage, 2017-12-15)
      SteemIt is a Reddit-like social news site that pays members for posting and curating content. It uses micropayments backed by a tradeable currency, exploiting the Bitcoin cryptocurrency generation model to finance content provision in conjunction with advertising. If successful, this paradigm might change the way in which volunteer-based sites operate. This paper investigates 925,092 new members’ first posts for insights into what drives financial success in the site. Initial blog posts on average received $0.01, although the maximum accrued was $20,680.83. Longer, more sentiment-rich or more positive comments with personal information received the greatest financial reward in contrast to more informational or topical content. Thus, there is a clear financial value in starting with a friendly introduction rather than immediately attempting to provide useful content, despite the latter being the ultimate site goal. Follow-up posts also tended to be more successful when more personal, suggesting that interpersonal communication rather than quality content provision has driven the site so far. It remains to be seen whether the model of small typical rewards and the possibility that a post might generate substantially more are enough to incentivise long term participation or a greater focus on informational posts in the long term.
    • Can the Web give useful information about commercial uses of scientific research?

      Thelwall, Mike (Emerald Group Publishing Limited, 2004)
      Invocations of pure and applied science journals in the Web were analysed, focussing on commercial sites, in order to assess whether the Web can yield useful information about university-industry knowledge transfer. On a macro level, evidence was found that applied research was more highly invoked on the non-academic Web than pure research, but only in one of the two fields studied. On a micro level, instances of clear evidence of the transfer of academic knowledge to a commercial setting were sparse. Science research on the Web seems to be invoked mainly for marketing purposes, although high technology companies can invoke published academic research as an organic part of a strategy to prove product effectiveness. It is conjectured that invoking academic research in business Web pages is rarely of clear commercial benefit to a company and that, except in unusual circumstances, benefits from research will be kept hidden to avoid giving intelligence to competitors.
    • Citation count distributions for large monodisciplinary journals

      Thelwall, Mike (Elsevier, 2016-07-25)
      Many different citation-based indicators are used by researchers and research evaluators to help evaluate the impact of scholarly outputs. Although the appropriateness of individual citation indicators depends in part on the statistical properties of citation counts, there is no universally agreed best-fitting statistical distribution against which to check them. The two current leading candidates are the discretised lognormal and the hooked or shifted power law. These have been mainly tested on sets of articles from a single field and year but these collections can include multiple specialisms that might dilute their properties. This article fits statistical distributions to 50 large subject-specific journals in the belief that individual journals can be purer than subject categories and may therefore give clearer findings. The results show that in most cases the discretised lognormal fits significantly better than the hooked power law, reversing previous findings for entire subcategories. This suggests that the discretised lognormal is the more appropriate distribution for modelling pure citation data. Thus, future analytical investigations of the properties of citation indicators can use the lognormal distribution to analyse their basic properties. This article also includes improved software for fitting the hooked power law.
    • Classifying referential and non-referential it using gaze

      Yaneva, Victoria; Ha, Le An; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics (ACL), 2018-10-31)
      When processing a text, humans and machines must disambiguate between different uses of the pronoun it, including non-referential, nominal anaphoric or clause anaphoric ones. In this paper, we use eye-tracking data to learn how humans perform this disambiguation. We use this knowledge to improve the automatic classification of it. We show that by using gaze data and a POS-tagger we are able to significantly outperform a common baseline and classify between three categories of it with an accuracy comparable to that of linguisticbased approaches. In addition, the discriminatory power of specific gaze features informs the way humans process the pronoun, which, to the best of our knowledge, has not been explored using data from a natural reading task.
    • Co-saved, co-tweeted, and co-cited networks

      Didegah, Fereshteh; Thelwall, Mike; Danish Centre for Studies in Research & Research Policy, Department of Political Science & Government; Aarhus University; Aarhus Denmark; Statistical Cybermetrics Research Group, University of Wolverhampton, Wulfruna Street; Wolverhampton WV1 1LY UK (Wiley-Blackwell, 2018-05-14)
      Counts of tweets and Mendeley user libraries have been proposed as altmetric alternatives to citation counts for the impact assessment of articles. Although both have been investigated to discover whether they correlate with article citations, it is not known whether users tend to tweet or save (in Mendeley) the same kinds of articles that they cite. In response, this article compares pairs of articles that are tweeted, saved to a Mendeley library, or cited by the same user, but possibly a different user for each source. The study analyzes 1,131,318 articles published in 2012, with minimum tweeted (10), saved to Mendeley (100), and cited (10) thresholds. The results show surprisingly minor overall overlaps between the three phenomena. The importance of journals for Twitter and the presence of many bots at different levels of activity suggest that this site has little value for impact altmetrics. The moderate differences between patterns of saving and citation suggest that Mendeley can be used for some types of impact assessments, but sensitivity is needed for underlying differences.
    • Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities

      Yaneva, Victoria; Orăsan, Constantin; Evans, Richard; Rohanian, Omid (Association for Computational Linguistics, 2017-09-08)
      Given the lack of large user-evaluated corpora in disability-related NLP research (e.g. text simplification or readability assessment for people with cognitive disabilities), the question of choosing suitable training data for NLP models is not straightforward. The use of large generic corpora may be problematic because such data may not reflect the needs of the target population. At the same time, the available user-evaluated corpora are not large enough to be used as training data. In this paper we explore a third approach, in which a large generic corpus is combined with a smaller population-specific corpus to train a classifier which is evaluated using two sets of unseen user-evaluated data. One of these sets, the ASD Comprehension corpus, is developed for the purposes of this study and made freely available. We explore the effects of the size and type of the training data used on the performance of the classifiers, and the effects of the type of the unseen test datasets on the classification performance.
    • Commercial Web site links.

      Thelwall, Mike (MCB UP Ltd, 2001)
      Every hyperlink pointing at a Web site is a potential source of new visitors, especially one near the top of a results page from a popular search engine. The order of the links in a search results page is often decided upon by an algorithm that takes into account the number and quality of links to all matching pages. The number of standard links targeted at a site is therefore doubly important, yet little research has touched on the actual interlinkage between business Web sites, which numerically dominate the Web. Discusses business use of the Web and related search engine design issues as well as research on general and academic links before reporting on a survey of the links published by a relatively random collection of business Web sites. The results indicate that around 66 percent of Web sites do carry external links, most of which are targeted at a specific purpose, but that about 17 percent publish general links, with implications for those designing and marketing Web sites.
    • Commercial Web sites: lost in cyberspace?

      Thelwall, Mike (MCB UP Ltd, 2000)
      How easy are business Web sites for potential customers to find? This paper reports on a survey of 60,087 Web sites from 42 of the major general and commercial domains around the world to extract statistics about their design and rate of search engine registration. Search engines are used by the majority of Web surfers to find information on the Web. However, 23 per cent of business Web sites in the survey were not registered at all in the five major search engines tested and 82 per cent were not registered in at least one, missing a sizeable potential audience. There are some simple steps that should also be taken to help a Web site to be indexed properly in search engines, primarily the use of HTML META tags for indexing, but only about a third of the site home pages in the survey used them. Wide national variations were found for both indexing and META tag inclusion.
    • Communication-based influence components model

      Cugelman, Brian; Thelwall, Mike; Dawes, Philip L. (New York: ACM, 2009)
      This paper discusses problems faced by planners of real-world online behavioural change interventions who must select behavioural change frameworks from a variety of competing theories and taxonomies. As a solution, this paper examines approaches that isolate the components of behavioural influence and shows how these components can be placed within an adapted communication framework to aid the design and analysis of online behavioural change interventions. Finally, using this framework, a summary of behavioural change factors are presented from an analysis of 32 online interventions.
    • Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?

      Costa, Hernani; Muñoz, Isabel Dúran; Pastor, Gloria Corpas; Mitkov, Ruslan (Universidade de Vigo & Universidade do Minho, 2016-07-22)
      Decisões tomadas anteriormente à compilação de um corpo comparável têm um grande impacto na forma em que este será posteriormente construído e analisado. Diversas variáveis e critérios externos são normalmente seguidos na construção de um corpo, mas pouco se tem investigado sobre a sua distribuição de similaridade textual interna ou nas suas vantagens qualitativas para a investigação. Numa tentativa de preencher esta lacuna, este artigo tem como objetivo apresentar uma metodologia simples, contudo eficiente, capaz de medir o grau de similaridade interno de um corpo. Para isso, a metodologia proposta usa diversas técnicas de processamento de linguagem natural e vários métodos estatísticos, numa tentativa bem sucedida de avaliar o grau de similaridade entre documentos. Os nossos resultados demonstram que a utilização de uma lista de entidades comuns e um conjunto de medidas de similaridade distribucional são suficientes, não só para descrever e avaliar o grau de similaridade entre os documentos num corpo comparável, mas também para os classificar de acordo com seu grau de semelhança e, consequentemente, melhorar a qualidade do corpos através da eliminação de documentos irrelevantes.
    • Computational Phraseology light: automatic translation of multiword expressions without translation resources

      Mitkov, Ruslan (De Gruyter Mouton, 2016-11-07)
      This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proofof- concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.’s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.
    • Computer-based assessment in numeracy and data analysis

      Binns, Ray; Thelwall, Mike (University of Wolverhampton, 2001)
    • Computing Happiness from Textual Data

      Mohamed, Emad; Mostafa, Safa (MDPI, 2019-07-03)
      In this paper, we use a corpus of about 100,000 happy moments written by people of different genders, marital statuses, parenthood statuses, and ages to explore the following questions: Are there differences between men and women, married and unmarried individuals, parents and non-parents, and people of different age groups in terms of their causes of happiness and how they express happiness? Can gender, marital status, parenthood status and/or age be predicted from textual data expressing happiness? The first question is tackled in two steps: first, we transform the happy moments into a set of topics, lemmas, part of speech sequences, and dependency relations; then, we use each set as predictors in multi-variable binary and multinomial logistic regressions to rank these predictors in terms of their influence on each outcome variable (gender, marital status, parenthood status and age). For the prediction task, we use character, lexical, grammatical, semantic, and syntactic features in a machine learning document classification approach. The classification algorithms used include logistic regression, gradient boosting, and fastText. Our results show that textual data expressing moments of happiness can be quite beneficial in understanding the “causes of happiness” for different social groups, and that social characteristics like gender, marital status, parenthood status, and, to some extent age, can be successfully predicted form such textual data. This research aims to bring together elements from philosophy and psychology to be examined by computational corpus linguistics methods in a way that promotes the use of Natural Language Processing for the Humanities.
    • Confidence intervals for normalised citation counts: Can they delimit underlying research capability?

      Thelwall, Mike (Elsevier, 2017-10-24)
      Normalised citation counts are routinely used to assess the average impact of research groups or nations. There is controversy over whether confidence intervals for them are theoretically valid or practically useful. In response, this article introduces the concept of a group’s underlying research capability to produce impactful research. It then investigates whether confidence intervals could delimit the underlying capability of a group in practice. From 123120 confidence interval comparisons for the average citation impact of the national outputs of ten countries within 36 individual large monodisciplinary journals, moderately fewer than 95% of subsequent indicator values fall within 95% confidence intervals from prior years, with the percentage declining over time. This is consistent with confidence intervals effectively delimiting the research capability of a group, although it does not prove that this is the cause of the results. The results are unaffected by whether internationally collaborative articles are included.
    • Corpora for Computational Linguistics

      Orăsan, Constantin; Ha, Le An; Evans, Richard; Hasler, Laura; Mitkov, Ruslan (Universidade Federal de Santa Catarina, 2007-01-01)
      Since the mid 90s corpora has become very important for computational linguistics. This paper offers a survey of how they are currently used in different fields of the discipline, with particular emphasis on anaphora and coreference resolution, automatic summarisation and term extraction. Their influence on other fields is also briefly discussed.
    • Corpus, Tecnología y Traducción

      Corpas Pastor, Gloria; Casas, M; García Antuña, M (Servicio de Publicaciones de la Universidad de Cádiz, 2012-04-25)
      No es casualidad que la Lingüística de Corpus floreciese especialmente en el contexto europeo. Recordemos que la investigación en tecnologías lingüísticas (o " industrias de la len-gua ") ha sido el marchamo de las políticas científicas europeas. 1 Desde ahí se ha favorecido la investigación en tecnologías lingüísticas como forma de salvaguardar, por un lado, la diversidad cultural y el multilingüismo de Europa, y, al mismo tiempo, superar las barreras y dificultades que esto supone para poder alcanzar los objetivos comunes a todos los europeos. Multilingüismo, multiculturalidad, traducción, tecnologías son rasgos inherentes a la sociedad europea actual. Se podría decir, además, que estas características definitorias han contribuido decisivamente al desarrollo de aplicaciones y recursos lingüísticos encaminados dar soporte a las políticas sociales europeas, y sus estribaciones en materia de comercio, educación e investigación. Si bien las tecnologías lingüísticas y el corpus se han abierto camino desde época muy temprana en las vertientes teóricas y aplicadas de la Lingüística, han sido necesarias varias décadas para que traductores e intérpretes se hayan subido por fin a este carro, que ya iba repleto de investigadores de otras disciplinas afines. En este trabajo realizaremos un breve ex-curso por lo que ha supuesto la incorporación de tales recursos y herramientas para el ámbito de la traducción y la interpretación, con especial referencia a las tecnologías propias del sec-1 Para una visión de conjunto sobre las políticas científicas europeas en materia de tecnologías lingüísticas, véase Corpas Pastor (2008).
    • Could scientists use Altmetric.com scores to predict longer term citation counts?

      Thelwall, Mike; Nevill, Tamara (Elsevier, 2018-01-30)
      Altmetrics from Altmetric.com are widely used by publishers and researchers to give earlier evidence of attention than citation counts. This article assesses whether Altmetric.com scores are reliable early indicators of likely future impact and whether they may also reflect non-scholarly impacts. A preliminary factor analysis suggests that the main altmetric indicator of scholarly impact is Mendeley reader counts, with weaker news, informational and social network discussion/promotion dimensions in some fields. Based on a regression analysis of Altmetric.com data from November 2015 and Scopus citation counts from October 2017 for articles in 30 narrow fields, only Mendeley reader counts are consistent predictors of future citation impact. Most other Altmetric.com scores can help predict future impact in some fields. Overall, the results confirm that early Altmetric.com scores can predict later citation counts, although less well than journal impact factors, and the optimal strategy is to consider both Altmetric.com scores and journal impact factors. Altmetric.com scores can also reflect dimensions of non-scholarly impact in some fields.
    • Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets

      Agić, Željko; Tiedemann, Jörg; Merkler, Danijela; Krek, Simon; Dobrovoljc, Kaja; Moze, Sara (Association for Computational Linguistics, 2014)
      This paper addresses cross-lingual dependency parsing using rich morphosyntactic tagsets. In our case study, we experiment with three related Slavic languages: Croatian, Serbian and Slovene. Four different dependency treebanks are used for monolingual parsing, direct cross-lingual parsing, and a recently introduced crosslingual parsing approach that utilizes statistical machine translation and annotation projection. We argue for the benefits of using rich morphosyntactic tagsets in cross-lingual parsing and empirically support the claim by showing large improvements over an impoverished common feature representation in form of a reduced part-of-speech tagset. In the process, we improve over the previous state-of-the-art scores in dependency parsing for all three languages.
    • Custom interfaces for advanced queries in search engines

      Thelwall, Mike; Binns, Ray; Harries, Gareth; Page-Kennedy, Theresa; Price, Liz; Wilkinson, David (MCB UP Ltd, 2001)
      Those seeking information from the Internet often start from a search engine, using either its organised directory structure or its text query facility. In response to the difficulty in identifying the most relevant pages for some information needs, many search engines offer Boolean text matching and some, including Google, AltaVista and HotBot, offer the facility to integrate additional information into a more advanced request. Amongst web users, however, it is known that the employment of complex enquiries is far from universal, with very short queries being the norm. It is demonstrated that the gap between the provision of advanced search facilities and their use can be bridged, for specific information needs, by the construction of a simple interface in the form of a website that automatically formulates the necessary requests. It is argued that this kind of resource, perhaps employing additional knowledge domain specific information, is one that could be useful for websites or portals of common interest groups. The approach is illustrated by a website that enables a user to search the individual websites of university level institutions in European Union associated countries.