• Can Amazon.com reviews help to assess the wider impacts of books?

      Kousha, Kayvan; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom (2016-03)
    • Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

      Kousha, Kayvan; Thelwall, Mike (Elsevier, 2019-03-11)
      Dissertations can be the single most important scholarly outputs of junior researchers. Whilst sets of journal articles are often evaluated with the help of citation counts from the Web of Science or Scopus, these do not index dissertations and so their impact is hard to assess. In response, this article introduces a new multistage method to extract Google Scholar citation counts for large collections of dissertations from repositories indexed by Google. The method was used to extract Google Scholar citation counts for 77,884 American doctoral dissertations from 2013 to 2017 via ProQuest, with a precision of over 95%. Some ProQuest dissertations that were dual indexed with other repositories could not be retrieved with ProQuest-specific searches but could be found with Google Scholar searches of the other repositories. The Google Scholar citation counts were then compared with Mendeley reader counts, a known source of scholarly-like impact data. A fifth of the dissertations had at least one citation recorded in Google Scholar and slightly fewer had at least one Mendeley reader. Based on numerical comparisons, the Mendeley reader counts seem to be more useful for impact assessment purposes for dissertations that are less than two years old, whilst Google Scholar citations are more useful for older dissertations, especially in social sciences, arts and humanities. Google Scholar citation counts may reflect a more scholarly type of impact than that of Mendeley reader counts because dissertations attract a substantial minority of their citations from other dissertations. In summary, the new method now makes it possible for research funders, institutions and others to systematically evaluate the impact of dissertations, although additional Google Scholar queries for other online repositories are needed to ensure comprehensive coverage.
    • Can Google's PageRank be used to find the most important academic Web pages?

      Thelwall, Mike (MCB UP Ltd, 2003)
      Google's PageRank is an influential algorithm that uses a model of Web use that is dominated by its link structure in order to rank pages by their estimated value to the Web community. This paper reports on the outcome of applying the algorithm to the Web sites of three national university systems in order to test whether it is capable of identifying the most important Web pages. The results are also compared with simple inlink counts. It was discovered that the highest inlinked pages do not always have the highest PageRank, indicating that the two metrics are genuinely different, even for the top pages. More significantly, however, internal links dominated external links for the high ranks in either method and superficial reasons accounted for high scores in both cases. It is concluded that PageRank is not useful for identifying the top pages in a site and that it must be combined with a powerful text matching techniques in order to get the quality of information retrieval results provided by Google.
    • Can Microsoft Academic assess the early citation impact of in-press articles? A multi-discipline exploratory analysis

      Kousha, Kayvan; Abdoli, Mahshid; Thelwall, Mike (Elsevier, 2018-02-03)
      Many journals post accepted articles online before they are formally published in an issue. Early citation impact evidence for these articles could be helpful for timely research evaluation and to identify potentially important articles that quickly attract many citations. This article investigates whether Microsoft Academic can help with this task. For over 65,000 Scopus in-press articles from 2016 and 2017 across 26 fields, Microsoft Academic found 2-5 times as many citations as Scopus, depending on year and field. From manual checks of 1,122 Microsoft Academic citations not found in Scopus, Microsoft Academic’s citation indexing was faster but not much wider than Scopus for journals. It achieved this by associating citations to preprints with their subsequent in-press versions and by extracting citations from in-press articles. In some fields its coverage of scholarly digital libraries, such as arXiv.org, was also an advantage. Thus, Microsoft Academic seems to be a more comprehensive automatic source of citation counts for in-press articles than Scopus.
    • Can Microsoft Academic be used for citation analysis of preprint archives? The case of the Social Science Research Network

      Thelwall, Mike (Springer, 2018-03-07)
      Preprint archives play an important scholarly communication role within some fields. The impact of archives and individual preprints are difficult to analyse because online repositories are not indexed by the Web of Science or Scopus. In response, this article assesses whether the new Microsoft Academic can be used for citation analysis of preprint archives, focusing on the Social Science Research Network (SSRN). Although Microsoft Academic seems to index SSRN comprehensively, it groups a small fraction of SSRN papers into an easily retrievable set that has variations in character over time, making any field normalisation or citation comparisons untrustworthy. A brief parallel analysis of arXiv suggests that similar results would occur for other online repositories. Systematic analyses of preprint archives are nevertheless possible with Microsoft Academic when complete lists of archive publications are available from other sources because of its promising coverage and citation results.
    • Can museums find male or female audiences online with YouTube?

      Thelwall, Michael (Emerald Publishing Limited, 2018-08-31)
      Purpose: This article investigates if and why audience gender ratios vary between museum YouTube channels, including for museums of the same type. Design/methodology/approach: Gender ratios were examined for public comments on YouTube videos from 50 popular museums in English-speaking nations. Terms that were more frequently used by males or females in comments were also examined for gender differences. Findings: The ratio of female to male YouTube commenters varies almost a hundredfold between museums. Some of the difference could be explained by gendered interests in museum themes (e.g., military, art) but others were due to the topics chosen for online content and could address a gender minority audience. Practical implications: Museums can attract new audiences online with YouTube videos that target outside their expected demographics. Originality/value: This is the first analysis of YouTube audience gender for museums.
    • Can Social News Websites Pay for Content and Curation? The SteemIt Cryptocurrency Model

      Thelwall, Mike (SAGE Publishing, 2017-12-15)
      SteemIt is a Reddit-like social news site that pays members for posting and curating content. It uses micropayments backed by a tradeable currency, exploiting the Bitcoin cryptocurrency generation model to finance content provision in conjunction with advertising. If successful, this paradigm might change the way in which volunteer-based sites operate. This paper investigates 925,092 new members’ first posts for insights into what drives financial success in the site. Initial blog posts on average received $0.01, although the maximum accrued was $20,680.83. Longer, more sentiment-rich or more positive comments with personal information received the greatest financial reward in contrast to more informational or topical content. Thus, there is a clear financial value in starting with a friendly introduction rather than immediately attempting to provide useful content, despite the latter being the ultimate site goal. Follow-up posts also tended to be more successful when more personal, suggesting that interpersonal communication rather than quality content provision has driven the site so far. It remains to be seen whether the model of small typical rewards and the possibility that a post might generate substantially more are enough to incentivise long term participation or a greater focus on informational posts in the long term.
    • Can the Web give useful information about commercial uses of scientific research?

      Thelwall, Mike (Emerald Group Publishing Limited, 2004)
      Invocations of pure and applied science journals in the Web were analysed, focussing on commercial sites, in order to assess whether the Web can yield useful information about university-industry knowledge transfer. On a macro level, evidence was found that applied research was more highly invoked on the non-academic Web than pure research, but only in one of the two fields studied. On a micro level, instances of clear evidence of the transfer of academic knowledge to a commercial setting were sparse. Science research on the Web seems to be invoked mainly for marketing purposes, although high technology companies can invoke published academic research as an organic part of a strategy to prove product effectiveness. It is conjectured that invoking academic research in business Web pages is rarely of clear commercial benefit to a company and that, except in unusual circumstances, benefits from research will be kept hidden to avoid giving intelligence to competitors.
    • Characters or morphemes: how to represent words?

      Üstün, Ahmet; Kurfalı, Murathan; Can, Burcu (Association for Computational Linguistics, 2018)
      In this paper, we investigate the effects of using subword information in representation learning. We argue that using syntactic subword units effects the quality of the word representations positively. We introduce a morpheme-based model and compare it against to word-based, character-based, and character n-gram level models. Our model takes a list of candidate segmentations of a word and learns the representation of the word based on different segmentations that are weighted by an attention mechanism. We performed experiments on Turkish as a morphologically rich language and English with a comparably poorer morphology. The results show that morpheme-based models are better at learning word representations of morphologically complex languages compared to character-based and character n-gram level models since the morphemes help to incorporate more syntactic knowledge in learning, that makes morpheme-based models better at syntactic tasks.
    • Citation count distributions for large monodisciplinary journals

      Thelwall, Mike (Elsevier, 2016-07-25)
      Many different citation-based indicators are used by researchers and research evaluators to help evaluate the impact of scholarly outputs. Although the appropriateness of individual citation indicators depends in part on the statistical properties of citation counts, there is no universally agreed best-fitting statistical distribution against which to check them. The two current leading candidates are the discretised lognormal and the hooked or shifted power law. These have been mainly tested on sets of articles from a single field and year but these collections can include multiple specialisms that might dilute their properties. This article fits statistical distributions to 50 large subject-specific journals in the belief that individual journals can be purer than subject categories and may therefore give clearer findings. The results show that in most cases the discretised lognormal fits significantly better than the hooked power law, reversing previous findings for entire subcategories. This suggests that the discretised lognormal is the more appropriate distribution for modelling pure citation data. Thus, future analytical investigations of the properties of citation indicators can use the lognormal distribution to analyse their basic properties. This article also includes improved software for fitting the hooked power law.
    • Classifying referential and non-referential it using gaze

      Yaneva, Victoria; Ha, Le An; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics (ACL), 2018-10-31)
      When processing a text, humans and machines must disambiguate between different uses of the pronoun it, including non-referential, nominal anaphoric or clause anaphoric ones. In this paper, we use eye-tracking data to learn how humans perform this disambiguation. We use this knowledge to improve the automatic classification of it. We show that by using gaze data and a POS-tagger we are able to significantly outperform a common baseline and classify between three categories of it with an accuracy comparable to that of linguisticbased approaches. In addition, the discriminatory power of specific gaze features informs the way humans process the pronoun, which, to the best of our knowledge, has not been explored using data from a natural reading task.
    • Co-saved, co-tweeted, and co-cited networks

      Didegah, Fereshteh; Thelwall, Mike; Danish Centre for Studies in Research & Research Policy, Department of Political Science & Government; Aarhus University; Aarhus Denmark; Statistical Cybermetrics Research Group, University of Wolverhampton, Wulfruna Street; Wolverhampton WV1 1LY UK (Wiley-Blackwell, 2018-05-14)
      Counts of tweets and Mendeley user libraries have been proposed as altmetric alternatives to citation counts for the impact assessment of articles. Although both have been investigated to discover whether they correlate with article citations, it is not known whether users tend to tweet or save (in Mendeley) the same kinds of articles that they cite. In response, this article compares pairs of articles that are tweeted, saved to a Mendeley library, or cited by the same user, but possibly a different user for each source. The study analyzes 1,131,318 articles published in 2012, with minimum tweeted (10), saved to Mendeley (100), and cited (10) thresholds. The results show surprisingly minor overall overlaps between the three phenomena. The importance of journals for Twitter and the presence of many bots at different levels of activity suggest that this site has little value for impact altmetrics. The moderate differences between patterns of saving and citation suggest that Mendeley can be used for some types of impact assessments, but sensitivity is needed for underlying differences.
    • Collaborative machine translation service for scientific texts

      Lambert, Patrik; Senellart, Jean; Romary, Laurent; Schwenk, Holger; Zipser, Florian; Lopez, Patrice; Blain, Frederic (Association for Computational Linguistics, 2012-04-30)
      French researchers are required to frequently translate into French the description of their work published in English. At the same time, the need for French people to access articles in English, or to international researchers to access theses or papers in French, is incorrectly resolved via the use of generic translation tools. We propose the demonstration of an end-to-end tool integrated in the HAL open archive for enabling efficient translation for scientific texts. This tool can give translation suggestions adapted to the scientific domain, improving by more than 10 points the BLEU score of a generic system. It also provides a post-edition service which captures user post-editing data that can be used to incrementally improve the translations engines. Thus it is helpful for users which need to translate or to access scientific texts.
    • Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities

      Yaneva, Victoria; Orăsan, Constantin; Evans, Richard; Rohanian, Omid (Association for Computational Linguistics, 2017-09-08)
      Given the lack of large user-evaluated corpora in disability-related NLP research (e.g. text simplification or readability assessment for people with cognitive disabilities), the question of choosing suitable training data for NLP models is not straightforward. The use of large generic corpora may be problematic because such data may not reflect the needs of the target population. At the same time, the available user-evaluated corpora are not large enough to be used as training data. In this paper we explore a third approach, in which a large generic corpus is combined with a smaller population-specific corpus to train a classifier which is evaluated using two sets of unseen user-evaluated data. One of these sets, the ASD Comprehension corpus, is developed for the purposes of this study and made freely available. We explore the effects of the size and type of the training data used on the performance of the classifiers, and the effects of the type of the unseen test datasets on the classification performance.
    • Combining quality estimation and automatic post-editing to enhance machine translation output

      Chatterjee, Rajen; Negri, Matteo; Turchi, Marco; Blain, Frédéric; Specia, Lucia (Association for Machine Translation in the America, 2018-03)
    • Commercial Web site links.

      Thelwall, Mike (MCB UP Ltd, 2001)
      Every hyperlink pointing at a Web site is a potential source of new visitors, especially one near the top of a results page from a popular search engine. The order of the links in a search results page is often decided upon by an algorithm that takes into account the number and quality of links to all matching pages. The number of standard links targeted at a site is therefore doubly important, yet little research has touched on the actual interlinkage between business Web sites, which numerically dominate the Web. Discusses business use of the Web and related search engine design issues as well as research on general and academic links before reporting on a survey of the links published by a relatively random collection of business Web sites. The results indicate that around 66 percent of Web sites do carry external links, most of which are targeted at a specific purpose, but that about 17 percent publish general links, with implications for those designing and marketing Web sites.
    • Commercial Web sites: lost in cyberspace?

      Thelwall, Mike (MCB UP Ltd, 2000)
      How easy are business Web sites for potential customers to find? This paper reports on a survey of 60,087 Web sites from 42 of the major general and commercial domains around the world to extract statistics about their design and rate of search engine registration. Search engines are used by the majority of Web surfers to find information on the Web. However, 23 per cent of business Web sites in the survey were not registered at all in the five major search engines tested and 82 per cent were not registered in at least one, missing a sizeable potential audience. There are some simple steps that should also be taken to help a Web site to be indexed properly in search engines, primarily the use of HTML META tags for indexing, but only about a third of the site home pages in the survey used them. Wide national variations were found for both indexing and META tag inclusion.
    • Communication-based influence components model

      Cugelman, Brian; Thelwall, Mike; Dawes, Philip L. (New York: ACM, 2009)
      This paper discusses problems faced by planners of real-world online behavioural change interventions who must select behavioural change frameworks from a variety of competing theories and taxonomies. As a solution, this paper examines approaches that isolate the components of behavioural influence and shows how these components can be placed within an adapted communication framework to aid the design and analysis of online behavioural change interventions. Finally, using this framework, a summary of behavioural change factors are presented from an analysis of 32 online interventions.
    • Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?

      Costa, Hernani; Muñoz, Isabel Dúran; Pastor, Gloria Corpas; Mitkov, Ruslan (Universidade de Vigo & Universidade do Minho, 2016-07-22)
      Decisões tomadas anteriormente à compilação de um corpo comparável têm um grande impacto na forma em que este será posteriormente construído e analisado. Diversas variáveis e critérios externos são normalmente seguidos na construção de um corpo, mas pouco se tem investigado sobre a sua distribuição de similaridade textual interna ou nas suas vantagens qualitativas para a investigação. Numa tentativa de preencher esta lacuna, este artigo tem como objetivo apresentar uma metodologia simples, contudo eficiente, capaz de medir o grau de similaridade interno de um corpo. Para isso, a metodologia proposta usa diversas técnicas de processamento de linguagem natural e vários métodos estatísticos, numa tentativa bem sucedida de avaliar o grau de similaridade entre documentos. Os nossos resultados demonstram que a utilização de uma lista de entidades comuns e um conjunto de medidas de similaridade distribucional são suficientes, não só para descrever e avaliar o grau de similaridade entre os documentos num corpo comparável, mas também para os classificar de acordo com seu grau de semelhança e, consequentemente, melhorar a qualidade do corpos através da eliminação de documentos irrelevantes.
    • Computational Phraseology light: automatic translation of multiword expressions without translation resources

      Mitkov, Ruslan (De Gruyter Mouton, 2016-10-27)
      This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proofof- concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.’s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.