• Assessing the teaching value of non-English academic books: The case of Spain

      Mas Bleda, Amalia; Thelwall, Mike (Consejo Superior de Investigaciones Científicas, 2018-12-01)
      This study examines the educational value of 15,117 Spanish-language books published by Spanish publishers in social sciences and humanities fields in the period 2002-2011, based on mentions of them extracted automatically from online course syllabi. A method was developed to collect syllabus mentions and filter out false matches. Manual checks of the 52,716 syllabus mentions found estimated an accuracy of 99.5% for filtering out false mentions and 74.7% for identifying correct mentions. A fifth of the sampled books (2,849; 19%) were mentioned at least once in online syllabi and almost all (95%) were from a third of the publishers included in the study. An in-depth analysis of the 23 books recommended most often in online syllabi showed that they are mostly single-authored humanities monographs that were originally written in Spanish. The syllabus mentions originated from 379 domains, but mostly from Spanish university websites. In conclusion, it is possible to make indicators from online syllabus mentions to assess the teaching value of Spanish-language books, although manual checks are needed if the values ​​are used for assessing individual books.
    • Attention: there is an inconsistency between android permissions and application metadata!

      Alecakir, Huseyin; Can, Burcu; Sen, Sevil (Springer Science and Business Media LLC, 2021-01-07)
      Since mobile applications make our lives easier, there is a large number of mobile applications customized for our needs in the application markets. While the application markets provide us a platform for downloading applications, it is also used by malware developers in order to distribute their malicious applications. In Android, permissions are used to prevent users from installing applications that might violate the users’ privacy by raising their awareness. From the privacy and security point of view, if the functionality of applications is given in sufficient detail in their descriptions, then the requirement of requested permissions could be well-understood. This is defined as description-to-permission fidelity in the literature. In this study, we propose two novel models that address the inconsistencies between the application descriptions and the requested permissions. The proposed models are based on the current state-of-art neural architectures called attention mechanisms. Here, we aim to find the permission statement words or sentences in app descriptions by using the attention mechanism along with recurrent neural networks. The lack of such permission statements in application descriptions creates a suspicion. Hence, the proposed approach could assist in static analysis techniques in order to find suspicious apps and to prioritize apps for more resource intensive analysis techniques. The experimental results show that the proposed approach achieves high accuracy.
    • Author verification of Nahj Al-Balagha

      Sarwar, Raheem; Mohamed, Emad (Oxford University Press, 2022-01-20)
      The primary purpose of this paper is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi’i Muslims are proposing different theories. Given the morphologically complex nature of Arabic, we test whether morphological segmentation, applied to the book and works by the two authors suspected by Sunnis to have authored the texts, can be used for author verification of the Nahj Al-Balagha. Our findings indicate that morphological segmentation may lead to slightly better results than whole words, and that regardless of the feature sets, the three sub-corpora cluster into three distinct groups using Principal Component Analysis, Hierarchical Clustering, Multi-dimensional Scaling and Bootstrap Consensus Trees. Supervised classification methods such as Naive Bayes, Support Vector Machines, k Nearest Neighbours, Random Forests, AdaBoost, Bagging and Decision Trees confirm the same results, which is a clear indication that (a) the book is internally consistent and can thus be attributed to a single person, and (b) it was not authored by either of the suspected authors.
    • Autism and the web: using web-searching tasks to detect autism and improve web accessibility

      Yaneva, Victoria (Association for Computing Machinery (ACM), 2018-08-02)
      People with autism consistently exhibit different attention-shifting patterns compared to neurotypical people. Research has shown that these differences can be successfully captured using eye tracking. In this paper, we summarise our recent research on using gaze data from web-related tasks to address two problems: improving web accessibility for people with autism and detecting autism automatically. We first examine the way a group of participants with autism and a control group process the visual information from web pages and provide empirical evidence of different visual searching strategies. We then use these differences in visual attention, to train a machine learning classifier which can successfully use the gaze data to distinguish between the two groups with an accuracy of 0.75. At the end of this paper we review the way forward to improving web accessibility and automatic autism detection, as well as the practical implications and alternatives for using eye tracking in these research areas.
    • Automated Web issue analysis: A nurse prescribing case study

      Thelwall, Mike; Thelwall, Saheeda; Fairclough, Ruth (Elsevier, 2006)
      Web issue analysis, a new automated technique designed to rapidly give timely management intelligence about a topic from an automated large-scale analysis of relevant pages from the Web, is introduced and demonstrated. The technique includes hyperlink and URL analysis to identify common direct and indirect sources of Web information. In addition, text analysis through natural language processing techniques is used identify relevant common nouns and noun phrases. A case study approach is taken, applying Web issue analysis to the topic of nurse prescribing. The results are presented in descriptive form and a qualitative analysis is used to argue that new information has been found. The nurse prescribing results demonstrate interesting new findings, such as the parochial nature of the topic in the UK, an apparent absence of similar concepts internationally, at least in the English-speaking world, and a significant concern with mental health issues. These demonstrate that automated Web issue analysis is capable of quickly delivering new insights into a problem. General limitations are that the success of Web issue analysis is dependant upon the particular topic chosen and the ability to find a phrase that accurately captures the topic and is not used in other contexts, as well as being language-specific.
    • Automatic multidocument summarization of research abstracts: Design and user evaluation

      Ou, Shiyan; Khoo, Christopher S.G.; Goh, Dion H. (Wiley, 2007)
      The purpose of this study was to develop a method for automatic construction of multidocument summaries of sets of research abstracts that may be retrieved by a digital library or search engine in response to a user query. Sociology dissertation abstracts were selected as the sample domain in this study. A variable-based framework was proposed for integrating and organizing research concepts and relationships as well as research methods and contextual relations extracted from different dissertation abstracts. Based on the framework, a new summarization method was developed, which parses the discourse structure of abstracts, extracts research concepts and relationships, integrates the information across different abstracts, and organizes and presents them in a Web-based interface. The focus of this article is on the user evaluation that was performed to assess the overall quality and usefulness of the summaries. Two types of variable-based summaries generated using the summarization method - with or without the use of a taxonomy - were compared against a sentence-based summary that lists only the research-objective sentences extracted from each abstract and another sentence-based summary generated using the MEAD system that extracts important sentences. The evaluation results indicate that the majority of sociological researchers (70%) and general users (64%) preferred the variable-based summaries generated with the use of the taxonomy.
    • Automatic summarisation: 25 years On

      Orăsan, Constantin (Cambridge University Press (CUP), 2019-09-19)
      Automatic text summarisation is a topic that has been receiving attention from the research community from the early days of computational linguistics, but it really took off around 25 years ago. This article presents the main developments from the last 25 years. It starts by defining what a summary is and how its definition changed over time as a result of the interest in processing new types of documents. The article continues with a brief history of the field and highlights the main challenges posed by the evaluation of summaries. The article finishes with some thoughts about the future of the field.
    • Avoiding obscure topics and generalising findings produces higher impact research

      Thelwall, Mike (Springer, 2016-10-11)
      Much academic research is never cited and may be rarely read, indicating wasted effort from the authors, referees and publishers. One reason that an article could be ignored is that its topic is, or appears to be, too obscure to be of wide interest, even if excellent scholarship produced it. This paper reports a word frequency analysis of 874,411 English article titles from 18 different Scopus natural, formal, life and health sciences categories 2009-2015 to assess the likelihood that research on obscure (rarely researched) topics is less cited. In all categories examined, unusual words in article titles associate with below average citation impact research. Thus, researchers considering obscure topics may wish to reconsider, generalise their study, or to choose a title that reflects the wider lessons that can be drawn. Authors should also consider including multiple concepts and purposes within their titles in order to attract a wider audience.
    • BDAFRICA: diseño e implementación de una base de datos de la literatura poscolonial africana publicada en España

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Universidad de Valladolid, 2016-01-10)
      Este trabajo demuestra que no existe un repositorio que incluya los autores poscoloniales africanos publicados hasta el momento en España y que permita, por ende, realizar estudios cuantitativos y cualitativos del impacto de esta literatura con la precisión deseable. Esto supone una carencia tanto para investigaciones académicas como para el sector editorial a la hora de analizar tendencias de selección y recepción en el mercado. Ante esta situación, el objetivo primordial de este trabajo es diseñar e implementar una base de datos, basada en MySQL y delimitada por unos parámetros muy concretos, que recoja todas las obras de autores africanos publicadas en castellano en España entre 1972 (año en que España se unió al sistema ISBN) y 2014. Tras determinar unos criterios de diseño y unos protocolos de compilación específcos, el desarrollo metodológico se ha dividido en cuatro fases: recopilación, almacenamiento, tratamiento y difusión de los datos. Así, la base de datos BDÁFRICA consigue un doble objetivo: por un lado, proporciona a los investigadores datos fables en los que basar sus estudios y, por otro, permitiría ofrecer por primera vez datos estadísticos de la evolución de la publicación de obras de autores africanos en España en los últimos 42 años.
    • Blog Searching: The First General-Purpose Source of Retrospective Public Opinion in the Social Sciences?

      Thelwall, Mike (Emerald, 2007)
      Purpose – To demonstrate how blog searching can be used as a retrospective source of public opinion. Design/methodology/approach - In this paper a variety of blog searching techniques are described and illustrated with a case study of the Danish cartoons affair. Findings - A time series analysis of related blog postings suggests that the Danish cartoons issue attracted little attention in the English-speaking world for four months after the initial publication of the cartoons, exploding only after the simultaneous start of diplomatic sanctions and a commercial boycott. Research limitations/implications – Blogs only reveal the opinions of bloggers, and blog analysis is language-specific. Sections of the world and the population of individual countries that do not have access to the internet will not be adequately represented in blogspace. Moreover, bloggers are self-selected and probably not representative of internet users. Originality/value - The existence of blog search engines now allows researchers to search blogspace for posts relating to any given debate, seeking either the opinions of blogging pundits or casual mentions in personal journals. It is possible to use blogs to examine topics before they first attracted mass media attention, as well as to dissect ongoing discussions. This gives a retrospective source of public opinion that is unique to blog search engines.
    • Book genre and author gender: romance>paranormal-romance to autobiography>memoir

      Thelwall, Mike (Wiley-Blackwell, 2016-12-21)
      Although gender differences are known to exist in the publishing industry and in reader preferences, there is little public systematic evidence about them. This article uses evidence from the book-based social website Goodreads to provide a large scale analysis of 50 major English book genres based on author genders. The results show gender differences in authorship in almost all categories and gender differences the level of interest in, and ratings of, books in a minority of categories. Perhaps surprisingly in this context, there is not a clear gender-based relationship between the success of an author and their prevalence within a genre. The unexpected almost universal authorship gender differences should give new impetus to investigations of the importance of gender in fiction and the success of minority genders in some genres should encourage publishers and librarians to take their work seriously, except perhaps for most male-authored chick-lit.
    • Bridging the “gApp”: improving neural machine translation systems for multiword expression detection

      Hidalgo-Ternero, Carlos Manuel; Pastor, Gloria Corpas (Walter de Gruyter GmbH, 2020-11-25)
      The present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.
    • Brief Communication: The clustering power of low frequency words in academic Webs

      Price, Liz; Thelwall, Mike (Wiley, 2005)
      The value of low frequency words for subject-based academic Web site clustering is assessed. A new technique is introduced to compare the relative clustering power of different vocabularies. The technique is designed for word frequency tests in large document clustering exercises. Results for the Australian and New Zealand academic Web spaces indicate that low frequency words are useful for clustering academic Web sites along subject lines; removing low frequency words results in sites becoming, on average, less dissimilar to sites from other subjects.
    • CAG : stylometric authorship attribution of multi-author documents using a co-authorship graph

      Sarwar, R; Urailertprasert, N; Vannaboot, N; Yu, C; Rakthanmanon, T; Chuangsuwanich, E; Nutanong, S (Institute of Electrical and Electronics Engineers (IEEE), 2020-01-17)
      Stylometry has been successfully applied to perform authorship identification of single-author documents (AISD). The AISD task is concerned with identifying the original author of an anonymous document from a group of candidate authors. However, AISD techniques are not applicable to the authorship identification of multi-author documents (AIMD). Unlike AISD, where each document is written by one single author, AIMD focuses on handling multi-author documents. Due to the combinatoric nature of documents, AIMD lacks the ground truth information - that is, information on writing and non-writing authors in a multi-author document - which makes this problem more challenging to solve. Previous AIMD solutions have a number of limitations: (i) the best stylometry-based AIMD solution has a low accuracy, less than 30%; (ii) increasing the number of co-authors of papers adversely affects the performance of AIMD solutions; and (iii) AIMD solutions were not designed to handle the non-writing authors (NWAs). However, NWAs exist in real-world cases - that is, there are papers for which not every co-author listed has contributed as a writer. This paper proposes an AIMD framework called the Co-Authorship Graph that can be used to (i) capture the stylistic information of each author in a corpus of multi-author documents and (ii) make a multi-label prediction for a multi-author query document. We conducted extensive experimental studies on one synthetic and three real-world corpora. Experimental results show that our proposed framework (i) significantly outperformed competitive techniques; (ii) can effectively handle a larger number of co-authors in comparison with competitive techniques; and (iii) can effectively handle NWAs in multi-author documents.
    • Can alternative indicators overcome language biases in citation counts? A comparison of Spanish and UK research

      Mas-Bleda, Amalia; Thelwall, Mike (Springer, 2016-09-09)
      This study compares Spanish and UK research in eight subject fields using a range of bibliometric and social media indicators. For each field, lists of Spanish and UK journal articles published in the year 2012 and their citation counts were extracted from Scopus. The software Webometric Analyst was then used to extract a range of altmetrics for these articles, including patent citations, online presentation mentions, online course syllabus mentions, Wikipedia mentions and Mendeley reader counts and Altmetric.com was used to extract Twitter mentions. Results show that Mendeley is the altmetric source with the highest coverage, with 80% of sampled articles having one or more Mendeley readers, followed by Twitter (34%). The coverage of the remaining sources was lower than 3%. All of the indicators checked either have too little data or increase the overall difference between Spain and the UK and so none can be suggested as alternatives to reduce the bias against Spain in traditional citation indexes.
    • Can Amazon.com reviews help to assess the wider impacts of books?

      Kousha, Kayvan; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom (2016-03)
    • Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

      Kousha, Kayvan; Thelwall, Mike (Elsevier, 2019-03-11)
      Dissertations can be the single most important scholarly outputs of junior researchers. Whilst sets of journal articles are often evaluated with the help of citation counts from the Web of Science or Scopus, these do not index dissertations and so their impact is hard to assess. In response, this article introduces a new multistage method to extract Google Scholar citation counts for large collections of dissertations from repositories indexed by Google. The method was used to extract Google Scholar citation counts for 77,884 American doctoral dissertations from 2013 to 2017 via ProQuest, with a precision of over 95%. Some ProQuest dissertations that were dual indexed with other repositories could not be retrieved with ProQuest-specific searches but could be found with Google Scholar searches of the other repositories. The Google Scholar citation counts were then compared with Mendeley reader counts, a known source of scholarly-like impact data. A fifth of the dissertations had at least one citation recorded in Google Scholar and slightly fewer had at least one Mendeley reader. Based on numerical comparisons, the Mendeley reader counts seem to be more useful for impact assessment purposes for dissertations that are less than two years old, whilst Google Scholar citations are more useful for older dissertations, especially in social sciences, arts and humanities. Google Scholar citation counts may reflect a more scholarly type of impact than that of Mendeley reader counts because dissertations attract a substantial minority of their citations from other dissertations. In summary, the new method now makes it possible for research funders, institutions and others to systematically evaluate the impact of dissertations, although additional Google Scholar queries for other online repositories are needed to ensure comprehensive coverage.
    • Can Google's PageRank be used to find the most important academic Web pages?

      Thelwall, Mike (MCB UP Ltd, 2003)
      Google's PageRank is an influential algorithm that uses a model of Web use that is dominated by its link structure in order to rank pages by their estimated value to the Web community. This paper reports on the outcome of applying the algorithm to the Web sites of three national university systems in order to test whether it is capable of identifying the most important Web pages. The results are also compared with simple inlink counts. It was discovered that the highest inlinked pages do not always have the highest PageRank, indicating that the two metrics are genuinely different, even for the top pages. More significantly, however, internal links dominated external links for the high ranks in either method and superficial reasons accounted for high scores in both cases. It is concluded that PageRank is not useful for identifying the top pages in a site and that it must be combined with a powerful text matching techniques in order to get the quality of information retrieval results provided by Google.
    • Can Microsoft Academic assess the early citation impact of in-press articles? A multi-discipline exploratory analysis

      Kousha, Kayvan; Abdoli, Mahshid; Thelwall, Mike (Elsevier, 2018-02-03)
      Many journals post accepted articles online before they are formally published in an issue. Early citation impact evidence for these articles could be helpful for timely research evaluation and to identify potentially important articles that quickly attract many citations. This article investigates whether Microsoft Academic can help with this task. For over 65,000 Scopus in-press articles from 2016 and 2017 across 26 fields, Microsoft Academic found 2-5 times as many citations as Scopus, depending on year and field. From manual checks of 1,122 Microsoft Academic citations not found in Scopus, Microsoft Academic’s citation indexing was faster but not much wider than Scopus for journals. It achieved this by associating citations to preprints with their subsequent in-press versions and by extracting citations from in-press articles. In some fields its coverage of scholarly digital libraries, such as arXiv.org, was also an advantage. Thus, Microsoft Academic seems to be a more comprehensive automatic source of citation counts for in-press articles than Scopus.
    • Can Microsoft Academic be used for citation analysis of preprint archives? The case of the Social Science Research Network

      Thelwall, Mike (Springer, 2018-03-07)
      Preprint archives play an important scholarly communication role within some fields. The impact of archives and individual preprints are difficult to analyse because online repositories are not indexed by the Web of Science or Scopus. In response, this article assesses whether the new Microsoft Academic can be used for citation analysis of preprint archives, focusing on the Social Science Research Network (SSRN). Although Microsoft Academic seems to index SSRN comprehensively, it groups a small fraction of SSRN papers into an easily retrievable set that has variations in character over time, making any field normalisation or citation comparisons untrustworthy. A brief parallel analysis of arXiv suggests that similar results would occur for other online repositories. Systematic analyses of preprint archives are nevertheless possible with Microsoft Academic when complete lists of archive publications are available from other sources because of its promising coverage and citation results.