• BDAFRICA: diseño e implementación de una base de datos de la literatura poscolonial africana publicada en España

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Universidad de Valladolid, 2016-01-10)
      Este trabajo demuestra que no existe un repositorio que incluya los autores poscoloniales africanos publicados hasta el momento en España y que permita, por ende, realizar estudios cuantitativos y cualitativos del impacto de esta literatura con la precisión deseable. Esto supone una carencia tanto para investigaciones académicas como para el sector editorial a la hora de analizar tendencias de selección y recepción en el mercado. Ante esta situación, el objetivo primordial de este trabajo es diseñar e implementar una base de datos, basada en MySQL y delimitada por unos parámetros muy concretos, que recoja todas las obras de autores africanos publicadas en castellano en España entre 1972 (año en que España se unió al sistema ISBN) y 2014. Tras determinar unos criterios de diseño y unos protocolos de compilación específcos, el desarrollo metodológico se ha dividido en cuatro fases: recopilación, almacenamiento, tratamiento y difusión de los datos. Así, la base de datos BDÁFRICA consigue un doble objetivo: por un lado, proporciona a los investigadores datos fables en los que basar sus estudios y, por otro, permitiría ofrecer por primera vez datos estadísticos de la evolución de la publicación de obras de autores africanos en España en los últimos 42 años.
    • Bilingual contexts from comparable corpora to mine for translations of collocations

      Taslimipoor, Shiva; Mitkov, Ruslan; Corpas Pastor, Gloria; Fazly, Afsaneh (Springer, 2018-03-21)
      Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.
    • Blog Searching: The First General-Purpose Source of Retrospective Public Opinion in the Social Sciences?

      Thelwall, Mike (Emerald, 2007)
      Purpose – To demonstrate how blog searching can be used as a retrospective source of public opinion. Design/methodology/approach - In this paper a variety of blog searching techniques are described and illustrated with a case study of the Danish cartoons affair. Findings - A time series analysis of related blog postings suggests that the Danish cartoons issue attracted little attention in the English-speaking world for four months after the initial publication of the cartoons, exploding only after the simultaneous start of diplomatic sanctions and a commercial boycott. Research limitations/implications – Blogs only reveal the opinions of bloggers, and blog analysis is language-specific. Sections of the world and the population of individual countries that do not have access to the internet will not be adequately represented in blogspace. Moreover, bloggers are self-selected and probably not representative of internet users. Originality/value - The existence of blog search engines now allows researchers to search blogspace for posts relating to any given debate, seeking either the opinions of blogging pundits or casual mentions in personal journals. It is possible to use blogs to examine topics before they first attracted mass media attention, as well as to dissect ongoing discussions. This gives a retrospective source of public opinion that is unique to blog search engines.
    • Book genre and author gender: romance>paranormal-romance to autobiography>memoir

      Thelwall, Mike (Wiley-Blackwell, 2016-05-20)
      Although gender differences are known to exist in the publishing industry and in reader preferences, there is little public systematic evidence about them. This article uses evidence from the book-based social website Goodreads to provide a large scale analysis of 50 major English book genres based on author genders. The results show gender differences in authorship in almost all categories and gender differences the level of interest in, and ratings of, books in a minority of categories. Perhaps surprisingly in this context, there is not a clear gender-based relationship between the success of an author and their prevalence within a genre. The unexpected almost universal authorship gender differences should give new impetus to investigations of the importance of gender in fiction and the success of minority genders in some genres should encourage publishers and librarians to take their work seriously, except perhaps for most male-authored chick-lit.
    • Bridging the gap: attending to discontinuity in identification of multiword expressions

      Rohanian, Omid; Taslimipoor, Shiva; Kouchaki, Samaneh; Ha, Le An; Mitkov, Ruslan (Association for Computational Linguistics, 2019-06-05)
      We introduce a new method to tag Multiword Expressions (MWEs) using a linguistically interpretable language-independent deep learning architecture. We specifically target discontinuity, an under-explored aspect that poses a significant challenge to computational treatment of MWEs. Two neural architectures are explored: Graph Convolutional Network (GCN) and multi-head self-attention. GCN leverages dependency parse information, and self-attention attends to long-range relations. We finally propose a combined model that integrates complementary information from both, through a gating mechanism. The experiments on a standard multilingual dataset for verbal MWEs show that our model outperforms the baselines not only in the case of discontinuous MWEs but also in overall F-score.
    • Brief Communication: The clustering power of low frequency words in academic Webs

      Price, Liz; Thelwall, Mike (Wiley, 2005)
      The value of low frequency words for subject-based academic Web site clustering is assessed. A new technique is introduced to compare the relative clustering power of different vocabularies. The technique is designed for word frequency tests in large document clustering exercises. Results for the Australian and New Zealand academic Web spaces indicate that low frequency words are useful for clustering academic Web sites along subject lines; removing low frequency words results in sites becoming, on average, less dissimilar to sites from other subjects.
    • Can alternative indicators overcome language biases in citation counts? A comparison of Spanish and UK research

      Mas-Bleda, Amalia; Thelwall, Mike (Springer, 2016-09-09)
      This study compares Spanish and UK research in eight subject fields using a range of bibliometric and social media indicators. For each field, lists of Spanish and UK journal articles published in the year 2012 and their citation counts were extracted from Scopus. The software Webometric Analyst was then used to extract a range of altmetrics for these articles, including patent citations, online presentation mentions, online course syllabus mentions, Wikipedia mentions and Mendeley reader counts and Altmetric.com was used to extract Twitter mentions. Results show that Mendeley is the altmetric source with the highest coverage, with 80% of sampled articles having one or more Mendeley readers, followed by Twitter (34%). The coverage of the remaining sources was lower than 3%. All of the indicators checked either have too little data or increase the overall difference between Spain and the UK and so none can be suggested as alternatives to reduce the bias against Spain in traditional citation indexes.
    • Can Amazon.com reviews help to assess the wider impacts of books?

      Kousha, Kayvan; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom (2016-03)
    • Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

      Kousha, Kayvan; Thelwall, Mike (Elsevier, 2019-03-11)
      Dissertations can be the single most important scholarly outputs of junior researchers. Whilst sets of journal articles are often evaluated with the help of citation counts from the Web of Science or Scopus, these do not index dissertations and so their impact is hard to assess. In response, this article introduces a new multistage method to extract Google Scholar citation counts for large collections of dissertations from repositories indexed by Google. The method was used to extract Google Scholar citation counts for 77,884 American doctoral dissertations from 2013 to 2017 via ProQuest, with a precision of over 95%. Some ProQuest dissertations that were dual indexed with other repositories could not be retrieved with ProQuest-specific searches but could be found with Google Scholar searches of the other repositories. The Google Scholar citation counts were then compared with Mendeley reader counts, a known source of scholarly-like impact data. A fifth of the dissertations had at least one citation recorded in Google Scholar and slightly fewer had at least one Mendeley reader. Based on numerical comparisons, the Mendeley reader counts seem to be more useful for impact assessment purposes for dissertations that are less than two years old, whilst Google Scholar citations are more useful for older dissertations, especially in social sciences, arts and humanities. Google Scholar citation counts may reflect a more scholarly type of impact than that of Mendeley reader counts because dissertations attract a substantial minority of their citations from other dissertations. In summary, the new method now makes it possible for research funders, institutions and others to systematically evaluate the impact of dissertations, although additional Google Scholar queries for other online repositories are needed to ensure comprehensive coverage.
    • Can Google's PageRank be used to find the most important academic Web pages?

      Thelwall, Mike (MCB UP Ltd, 2003)
      Google's PageRank is an influential algorithm that uses a model of Web use that is dominated by its link structure in order to rank pages by their estimated value to the Web community. This paper reports on the outcome of applying the algorithm to the Web sites of three national university systems in order to test whether it is capable of identifying the most important Web pages. The results are also compared with simple inlink counts. It was discovered that the highest inlinked pages do not always have the highest PageRank, indicating that the two metrics are genuinely different, even for the top pages. More significantly, however, internal links dominated external links for the high ranks in either method and superficial reasons accounted for high scores in both cases. It is concluded that PageRank is not useful for identifying the top pages in a site and that it must be combined with a powerful text matching techniques in order to get the quality of information retrieval results provided by Google.
    • Can Microsoft Academic assess the early citation impact of in-press articles? A multi-discipline exploratory analysis

      Kousha, Kayvan; Abdoli, Mahshid; Thelwall, Mike (Elsevier, 2018-02-03)
      Many journals post accepted articles online before they are formally published in an issue. Early citation impact evidence for these articles could be helpful for timely research evaluation and to identify potentially important articles that quickly attract many citations. This article investigates whether Microsoft Academic can help with this task. For over 65,000 Scopus in-press articles from 2016 and 2017 across 26 fields, Microsoft Academic found 2-5 times as many citations as Scopus, depending on year and field. From manual checks of 1,122 Microsoft Academic citations not found in Scopus, Microsoft Academic’s citation indexing was faster but not much wider than Scopus for journals. It achieved this by associating citations to preprints with their subsequent in-press versions and by extracting citations from in-press articles. In some fields its coverage of scholarly digital libraries, such as arXiv.org, was also an advantage. Thus, Microsoft Academic seems to be a more comprehensive automatic source of citation counts for in-press articles than Scopus.
    • Can Microsoft Academic be used for citation analysis of preprint archives? The case of the Social Science Research Network

      Thelwall, Mike (Springer, 2018-03-07)
      Preprint archives play an important scholarly communication role within some fields. The impact of archives and individual preprints are difficult to analyse because online repositories are not indexed by the Web of Science or Scopus. In response, this article assesses whether the new Microsoft Academic can be used for citation analysis of preprint archives, focusing on the Social Science Research Network (SSRN). Although Microsoft Academic seems to index SSRN comprehensively, it groups a small fraction of SSRN papers into an easily retrievable set that has variations in character over time, making any field normalisation or citation comparisons untrustworthy. A brief parallel analysis of arXiv suggests that similar results would occur for other online repositories. Systematic analyses of preprint archives are nevertheless possible with Microsoft Academic when complete lists of archive publications are available from other sources because of its promising coverage and citation results.
    • Can museums find male or female audiences online with YouTube?

      Thelwall, Michael (Emerald, 2018-08-31)
      Purpose: This article investigates if and why audience gender ratios vary between museum YouTube channels, including for museums of the same type. Design/methodology/approach: Gender ratios were examined for public comments on YouTube videos from 50 popular museums in English-speaking nations. Terms that were more frequently used by males or females in comments were also examined for gender differences. Findings: The ratio of female to male YouTube commenters varies almost a hundredfold between museums. Some of the difference could be explained by gendered interests in museum themes (e.g., military, art) but others were due to the topics chosen for online content and could address a gender minority audience. Practical implications: Museums can attract new audiences online with YouTube videos that target outside their expected demographics. Originality/value: This is the first analysis of YouTube audience gender for museums.
    • Can Social News Websites Pay for Content and Curation? The SteemIt Cryptocurrency Model

      Thelwall, Mike (Sage, 2017-12-15)
      SteemIt is a Reddit-like social news site that pays members for posting and curating content. It uses micropayments backed by a tradeable currency, exploiting the Bitcoin cryptocurrency generation model to finance content provision in conjunction with advertising. If successful, this paradigm might change the way in which volunteer-based sites operate. This paper investigates 925,092 new members’ first posts for insights into what drives financial success in the site. Initial blog posts on average received $0.01, although the maximum accrued was $20,680.83. Longer, more sentiment-rich or more positive comments with personal information received the greatest financial reward in contrast to more informational or topical content. Thus, there is a clear financial value in starting with a friendly introduction rather than immediately attempting to provide useful content, despite the latter being the ultimate site goal. Follow-up posts also tended to be more successful when more personal, suggesting that interpersonal communication rather than quality content provision has driven the site so far. It remains to be seen whether the model of small typical rewards and the possibility that a post might generate substantially more are enough to incentivise long term participation or a greater focus on informational posts in the long term.
    • Can the Web give useful information about commercial uses of scientific research?

      Thelwall, Mike (Emerald Group Publishing Limited, 2004)
      Invocations of pure and applied science journals in the Web were analysed, focussing on commercial sites, in order to assess whether the Web can yield useful information about university-industry knowledge transfer. On a macro level, evidence was found that applied research was more highly invoked on the non-academic Web than pure research, but only in one of the two fields studied. On a micro level, instances of clear evidence of the transfer of academic knowledge to a commercial setting were sparse. Science research on the Web seems to be invoked mainly for marketing purposes, although high technology companies can invoke published academic research as an organic part of a strategy to prove product effectiveness. It is conjectured that invoking academic research in business Web pages is rarely of clear commercial benefit to a company and that, except in unusual circumstances, benefits from research will be kept hidden to avoid giving intelligence to competitors.
    • Citation count distributions for large monodisciplinary journals

      Thelwall, Mike (Elsevier, 2016-07-25)
      Many different citation-based indicators are used by researchers and research evaluators to help evaluate the impact of scholarly outputs. Although the appropriateness of individual citation indicators depends in part on the statistical properties of citation counts, there is no universally agreed best-fitting statistical distribution against which to check them. The two current leading candidates are the discretised lognormal and the hooked or shifted power law. These have been mainly tested on sets of articles from a single field and year but these collections can include multiple specialisms that might dilute their properties. This article fits statistical distributions to 50 large subject-specific journals in the belief that individual journals can be purer than subject categories and may therefore give clearer findings. The results show that in most cases the discretised lognormal fits significantly better than the hooked power law, reversing previous findings for entire subcategories. This suggests that the discretised lognormal is the more appropriate distribution for modelling pure citation data. Thus, future analytical investigations of the properties of citation indicators can use the lognormal distribution to analyse their basic properties. This article also includes improved software for fitting the hooked power law.
    • Classifying referential and non-referential it using gaze

      Yaneva, Victoria; Ha, Le An; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics (ACL), 2018-10-31)
      When processing a text, humans and machines must disambiguate between different uses of the pronoun it, including non-referential, nominal anaphoric or clause anaphoric ones. In this paper, we use eye-tracking data to learn how humans perform this disambiguation. We use this knowledge to improve the automatic classification of it. We show that by using gaze data and a POS-tagger we are able to significantly outperform a common baseline and classify between three categories of it with an accuracy comparable to that of linguisticbased approaches. In addition, the discriminatory power of specific gaze features informs the way humans process the pronoun, which, to the best of our knowledge, has not been explored using data from a natural reading task.
    • Co-saved, co-tweeted, and co-cited networks

      Didegah, Fereshteh; Thelwall, Mike; Danish Centre for Studies in Research & Research Policy, Department of Political Science & Government; Aarhus University; Aarhus Denmark; Statistical Cybermetrics Research Group, University of Wolverhampton, Wulfruna Street; Wolverhampton WV1 1LY UK (Wiley-Blackwell, 2018-05-14)
      Counts of tweets and Mendeley user libraries have been proposed as altmetric alternatives to citation counts for the impact assessment of articles. Although both have been investigated to discover whether they correlate with article citations, it is not known whether users tend to tweet or save (in Mendeley) the same kinds of articles that they cite. In response, this article compares pairs of articles that are tweeted, saved to a Mendeley library, or cited by the same user, but possibly a different user for each source. The study analyzes 1,131,318 articles published in 2012, with minimum tweeted (10), saved to Mendeley (100), and cited (10) thresholds. The results show surprisingly minor overall overlaps between the three phenomena. The importance of journals for Twitter and the presence of many bots at different levels of activity suggest that this site has little value for impact altmetrics. The moderate differences between patterns of saving and citation suggest that Mendeley can be used for some types of impact assessments, but sensitivity is needed for underlying differences.
    • Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities

      Yaneva, Victoria; Orăsan, Constantin; Evans, Richard; Rohanian, Omid (Association for Computational Linguistics, 2017-09-08)
      Given the lack of large user-evaluated corpora in disability-related NLP research (e.g. text simplification or readability assessment for people with cognitive disabilities), the question of choosing suitable training data for NLP models is not straightforward. The use of large generic corpora may be problematic because such data may not reflect the needs of the target population. At the same time, the available user-evaluated corpora are not large enough to be used as training data. In this paper we explore a third approach, in which a large generic corpus is combined with a smaller population-specific corpus to train a classifier which is evaluated using two sets of unseen user-evaluated data. One of these sets, the ASD Comprehension corpus, is developed for the purposes of this study and made freely available. We explore the effects of the size and type of the training data used on the performance of the classifiers, and the effects of the type of the unseen test datasets on the classification performance.
    • Commercial Web site links.

      Thelwall, Mike (MCB UP Ltd, 2001)
      Every hyperlink pointing at a Web site is a potential source of new visitors, especially one near the top of a results page from a popular search engine. The order of the links in a search results page is often decided upon by an algorithm that takes into account the number and quality of links to all matching pages. The number of standard links targeted at a site is therefore doubly important, yet little research has touched on the actual interlinkage between business Web sites, which numerically dominate the Web. Discusses business use of the Web and related search engine design issues as well as research on general and academic links before reporting on a survey of the links published by a relatively random collection of business Web sites. The results indicate that around 66 percent of Web sites do carry external links, most of which are targeted at a specific purpose, but that about 17 percent publish general links, with implications for those designing and marketing Web sites.