• Backtranslation feedback improves user confidence in MT, not quality

      Zouhar, Vilém; Novák, Michal; Žilinec, Matúš; Bojar, Ondřej; Obregón, Mateo; Hill, Robin L; Blain, Frédéric; Fomicheva, Marina; Specia, Lucia; Yankovskaya, Lisa; et al. (Association for Computational Linguistics, 2021-06-01)
      Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.
    • BDAFRICA: diseño e implementación de una base de datos de la literatura poscolonial africana publicada en España

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Universidad de Valladolid, 2016-01-10)
      Este trabajo demuestra que no existe un repositorio que incluya los autores poscoloniales africanos publicados hasta el momento en España y que permita, por ende, realizar estudios cuantitativos y cualitativos del impacto de esta literatura con la precisión deseable. Esto supone una carencia tanto para investigaciones académicas como para el sector editorial a la hora de analizar tendencias de selección y recepción en el mercado. Ante esta situación, el objetivo primordial de este trabajo es diseñar e implementar una base de datos, basada en MySQL y delimitada por unos parámetros muy concretos, que recoja todas las obras de autores africanos publicadas en castellano en España entre 1972 (año en que España se unió al sistema ISBN) y 2014. Tras determinar unos criterios de diseño y unos protocolos de compilación específcos, el desarrollo metodológico se ha dividido en cuatro fases: recopilación, almacenamiento, tratamiento y difusión de los datos. Así, la base de datos BDÁFRICA consigue un doble objetivo: por un lado, proporciona a los investigadores datos fables en los que basar sus estudios y, por otro, permitiría ofrecer por primera vez datos estadísticos de la evolución de la publicación de obras de autores africanos en España en los últimos 42 años.
    • BERGAMOT-LATTE submissions for the WMT20 quality estimation shared task

      Fomicheva, Marina; Sun, Shuo; Yankovskaya, Lisa; Blain, Frédéric; Chaudhary, Vishrav; Fishel, Mark; Guzmán, Francisco; Specia, Lucia (Association for Computational Linguistics, 2020-11-30)
      This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.
    • Bilexical embeddings for quality estimation

      Blain, Frédéric; Scarton, Carolina; Specia, Lucia (Association for Computational Linguistics, 2017-09)
    • Bilingual contexts from comparable corpora to mine for translations of collocations

      Taslimipoor, Shiva; Mitkov, Ruslan; Corpas Pastor, Gloria; Fazly, Afsaneh (Springer, 2018-03-21)
      Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.
    • Blog Searching: The First General-Purpose Source of Retrospective Public Opinion in the Social Sciences?

      Thelwall, Mike (Emerald, 2007)
      Purpose – To demonstrate how blog searching can be used as a retrospective source of public opinion. Design/methodology/approach - In this paper a variety of blog searching techniques are described and illustrated with a case study of the Danish cartoons affair. Findings - A time series analysis of related blog postings suggests that the Danish cartoons issue attracted little attention in the English-speaking world for four months after the initial publication of the cartoons, exploding only after the simultaneous start of diplomatic sanctions and a commercial boycott. Research limitations/implications – Blogs only reveal the opinions of bloggers, and blog analysis is language-specific. Sections of the world and the population of individual countries that do not have access to the internet will not be adequately represented in blogspace. Moreover, bloggers are self-selected and probably not representative of internet users. Originality/value - The existence of blog search engines now allows researchers to search blogspace for posts relating to any given debate, seeking either the opinions of blogging pundits or casual mentions in personal journals. It is possible to use blogs to examine topics before they first attracted mass media attention, as well as to dissect ongoing discussions. This gives a retrospective source of public opinion that is unique to blog search engines.
    • Book genre and author gender: romance>paranormal-romance to autobiography>memoir

      Thelwall, Mike (Wiley-Blackwell, 2016-12-21)
      Although gender differences are known to exist in the publishing industry and in reader preferences, there is little public systematic evidence about them. This article uses evidence from the book-based social website Goodreads to provide a large scale analysis of 50 major English book genres based on author genders. The results show gender differences in authorship in almost all categories and gender differences the level of interest in, and ratings of, books in a minority of categories. Perhaps surprisingly in this context, there is not a clear gender-based relationship between the success of an author and their prevalence within a genre. The unexpected almost universal authorship gender differences should give new impetus to investigations of the importance of gender in fiction and the success of minority genders in some genres should encourage publishers and librarians to take their work seriously, except perhaps for most male-authored chick-lit.
    • Bridging the gap: attending to discontinuity in identification of multiword expressions

      Rohanian, Omid; Taslimipoor, Shiva; Kouchaki, Samaneh; Ha, Le An; Mitkov, Ruslan (Association for Computational Linguistics, 2019-06-05)
      We introduce a new method to tag Multiword Expressions (MWEs) using a linguistically interpretable language-independent deep learning architecture. We specifically target discontinuity, an under-explored aspect that poses a significant challenge to computational treatment of MWEs. Two neural architectures are explored: Graph Convolutional Network (GCN) and multi-head self-attention. GCN leverages dependency parse information, and self-attention attends to long-range relations. We finally propose a combined model that integrates complementary information from both, through a gating mechanism. The experiments on a standard multilingual dataset for verbal MWEs show that our model outperforms the baselines not only in the case of discontinuous MWEs but also in overall F-score.
    • Bridging the “gApp”: improving neural machine translation systems for multiword expression detection

      Hidalgo-Ternero, Carlos Manuel; Pastor, Gloria Corpas (Walter de Gruyter GmbH, 2020-11-25)
      The present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.
    • Brief Communication: The clustering power of low frequency words in academic Webs

      Price, Liz; Thelwall, Mike (Wiley, 2005)
      The value of low frequency words for subject-based academic Web site clustering is assessed. A new technique is introduced to compare the relative clustering power of different vocabularies. The technique is designed for word frequency tests in large document clustering exercises. Results for the Australian and New Zealand academic Web spaces indicate that low frequency words are useful for clustering academic Web sites along subject lines; removing low frequency words results in sites becoming, on average, less dissimilar to sites from other subjects.
    • CAG : stylometric authorship attribution of multi-author documents using a co-authorship graph

      Sarwar, R; Urailertprasert, N; Vannaboot, N; Yu, C; Rakthanmanon, T; Chuangsuwanich, E; Nutanong, S (Institute of Electrical and Electronics Engineers (IEEE), 2020-01-17)
      Stylometry has been successfully applied to perform authorship identification of single-author documents (AISD). The AISD task is concerned with identifying the original author of an anonymous document from a group of candidate authors. However, AISD techniques are not applicable to the authorship identification of multi-author documents (AIMD). Unlike AISD, where each document is written by one single author, AIMD focuses on handling multi-author documents. Due to the combinatoric nature of documents, AIMD lacks the ground truth information - that is, information on writing and non-writing authors in a multi-author document - which makes this problem more challenging to solve. Previous AIMD solutions have a number of limitations: (i) the best stylometry-based AIMD solution has a low accuracy, less than 30%; (ii) increasing the number of co-authors of papers adversely affects the performance of AIMD solutions; and (iii) AIMD solutions were not designed to handle the non-writing authors (NWAs). However, NWAs exist in real-world cases - that is, there are papers for which not every co-author listed has contributed as a writer. This paper proposes an AIMD framework called the Co-Authorship Graph that can be used to (i) capture the stylistic information of each author in a corpus of multi-author documents and (ii) make a multi-label prediction for a multi-author query document. We conducted extensive experimental studies on one synthetic and three real-world corpora. Experimental results show that our proposed framework (i) significantly outperformed competitive techniques; (ii) can effectively handle a larger number of co-authors in comparison with competitive techniques; and (iii) can effectively handle NWAs in multi-author documents.
    • Can alternative indicators overcome language biases in citation counts? A comparison of Spanish and UK research

      Mas-Bleda, Amalia; Thelwall, Mike (Springer, 2016-09-09)
      This study compares Spanish and UK research in eight subject fields using a range of bibliometric and social media indicators. For each field, lists of Spanish and UK journal articles published in the year 2012 and their citation counts were extracted from Scopus. The software Webometric Analyst was then used to extract a range of altmetrics for these articles, including patent citations, online presentation mentions, online course syllabus mentions, Wikipedia mentions and Mendeley reader counts and Altmetric.com was used to extract Twitter mentions. Results show that Mendeley is the altmetric source with the highest coverage, with 80% of sampled articles having one or more Mendeley readers, followed by Twitter (34%). The coverage of the remaining sources was lower than 3%. All of the indicators checked either have too little data or increase the overall difference between Spain and the UK and so none can be suggested as alternatives to reduce the bias against Spain in traditional citation indexes.
    • Can Amazon.com reviews help to assess the wider impacts of books?

      Kousha, Kayvan; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom (2016-03)
    • Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

      Kousha, Kayvan; Thelwall, Mike (Elsevier, 2019-03-11)
      Dissertations can be the single most important scholarly outputs of junior researchers. Whilst sets of journal articles are often evaluated with the help of citation counts from the Web of Science or Scopus, these do not index dissertations and so their impact is hard to assess. In response, this article introduces a new multistage method to extract Google Scholar citation counts for large collections of dissertations from repositories indexed by Google. The method was used to extract Google Scholar citation counts for 77,884 American doctoral dissertations from 2013 to 2017 via ProQuest, with a precision of over 95%. Some ProQuest dissertations that were dual indexed with other repositories could not be retrieved with ProQuest-specific searches but could be found with Google Scholar searches of the other repositories. The Google Scholar citation counts were then compared with Mendeley reader counts, a known source of scholarly-like impact data. A fifth of the dissertations had at least one citation recorded in Google Scholar and slightly fewer had at least one Mendeley reader. Based on numerical comparisons, the Mendeley reader counts seem to be more useful for impact assessment purposes for dissertations that are less than two years old, whilst Google Scholar citations are more useful for older dissertations, especially in social sciences, arts and humanities. Google Scholar citation counts may reflect a more scholarly type of impact than that of Mendeley reader counts because dissertations attract a substantial minority of their citations from other dissertations. In summary, the new method now makes it possible for research funders, institutions and others to systematically evaluate the impact of dissertations, although additional Google Scholar queries for other online repositories are needed to ensure comprehensive coverage.
    • Can Google's PageRank be used to find the most important academic Web pages?

      Thelwall, Mike (MCB UP Ltd, 2003)
      Google's PageRank is an influential algorithm that uses a model of Web use that is dominated by its link structure in order to rank pages by their estimated value to the Web community. This paper reports on the outcome of applying the algorithm to the Web sites of three national university systems in order to test whether it is capable of identifying the most important Web pages. The results are also compared with simple inlink counts. It was discovered that the highest inlinked pages do not always have the highest PageRank, indicating that the two metrics are genuinely different, even for the top pages. More significantly, however, internal links dominated external links for the high ranks in either method and superficial reasons accounted for high scores in both cases. It is concluded that PageRank is not useful for identifying the top pages in a site and that it must be combined with a powerful text matching techniques in order to get the quality of information retrieval results provided by Google.
    • Can Microsoft Academic assess the early citation impact of in-press articles? A multi-discipline exploratory analysis

      Kousha, Kayvan; Abdoli, Mahshid; Thelwall, Mike (Elsevier, 2018-02-03)
      Many journals post accepted articles online before they are formally published in an issue. Early citation impact evidence for these articles could be helpful for timely research evaluation and to identify potentially important articles that quickly attract many citations. This article investigates whether Microsoft Academic can help with this task. For over 65,000 Scopus in-press articles from 2016 and 2017 across 26 fields, Microsoft Academic found 2-5 times as many citations as Scopus, depending on year and field. From manual checks of 1,122 Microsoft Academic citations not found in Scopus, Microsoft Academic’s citation indexing was faster but not much wider than Scopus for journals. It achieved this by associating citations to preprints with their subsequent in-press versions and by extracting citations from in-press articles. In some fields its coverage of scholarly digital libraries, such as arXiv.org, was also an advantage. Thus, Microsoft Academic seems to be a more comprehensive automatic source of citation counts for in-press articles than Scopus.
    • Can Microsoft Academic be used for citation analysis of preprint archives? The case of the Social Science Research Network

      Thelwall, Mike (Springer, 2018-03-07)
      Preprint archives play an important scholarly communication role within some fields. The impact of archives and individual preprints are difficult to analyse because online repositories are not indexed by the Web of Science or Scopus. In response, this article assesses whether the new Microsoft Academic can be used for citation analysis of preprint archives, focusing on the Social Science Research Network (SSRN). Although Microsoft Academic seems to index SSRN comprehensively, it groups a small fraction of SSRN papers into an easily retrievable set that has variations in character over time, making any field normalisation or citation comparisons untrustworthy. A brief parallel analysis of arXiv suggests that similar results would occur for other online repositories. Systematic analyses of preprint archives are nevertheless possible with Microsoft Academic when complete lists of archive publications are available from other sources because of its promising coverage and citation results.
    • Can museums find male or female audiences online with YouTube?

      Thelwall, Michael (Emerald Publishing Limited, 2018-08-31)
      Purpose: This article investigates if and why audience gender ratios vary between museum YouTube channels, including for museums of the same type. Design/methodology/approach: Gender ratios were examined for public comments on YouTube videos from 50 popular museums in English-speaking nations. Terms that were more frequently used by males or females in comments were also examined for gender differences. Findings: The ratio of female to male YouTube commenters varies almost a hundredfold between museums. Some of the difference could be explained by gendered interests in museum themes (e.g., military, art) but others were due to the topics chosen for online content and could address a gender minority audience. Practical implications: Museums can attract new audiences online with YouTube videos that target outside their expected demographics. Originality/value: This is the first analysis of YouTube audience gender for museums.
    • Can Social News Websites Pay for Content and Curation? The SteemIt Cryptocurrency Model

      Thelwall, Mike (SAGE Publishing, 2017-12-15)
      SteemIt is a Reddit-like social news site that pays members for posting and curating content. It uses micropayments backed by a tradeable currency, exploiting the Bitcoin cryptocurrency generation model to finance content provision in conjunction with advertising. If successful, this paradigm might change the way in which volunteer-based sites operate. This paper investigates 925,092 new members’ first posts for insights into what drives financial success in the site. Initial blog posts on average received $0.01, although the maximum accrued was $20,680.83. Longer, more sentiment-rich or more positive comments with personal information received the greatest financial reward in contrast to more informational or topical content. Thus, there is a clear financial value in starting with a friendly introduction rather than immediately attempting to provide useful content, despite the latter being the ultimate site goal. Follow-up posts also tended to be more successful when more personal, suggesting that interpersonal communication rather than quality content provision has driven the site so far. It remains to be seen whether the model of small typical rewards and the possibility that a post might generate substantially more are enough to incentivise long term participation or a greater focus on informational posts in the long term.
    • Can the Web give useful information about commercial uses of scientific research?

      Thelwall, Mike (Emerald Group Publishing Limited, 2004)
      Invocations of pure and applied science journals in the Web were analysed, focussing on commercial sites, in order to assess whether the Web can yield useful information about university-industry knowledge transfer. On a macro level, evidence was found that applied research was more highly invoked on the non-academic Web than pure research, but only in one of the two fields studied. On a micro level, instances of clear evidence of the transfer of academic knowledge to a commercial setting were sparse. Science research on the Web seems to be invoked mainly for marketing purposes, although high technology companies can invoke published academic research as an organic part of a strategy to prove product effectiveness. It is conjectured that invoking academic research in business Web pages is rarely of clear commercial benefit to a company and that, except in unusual circumstances, benefits from research will be kept hidden to avoid giving intelligence to competitors.