• GCN-Sem at SemEval-2019 Task 1: Semantic Parsing using Graph Convolutional and Recurrent Neural Networks

      Taslimipoor, Shiva; Rohanian, Omid; Može, Sara (Association for Computational Linguistics, 2019-06-06)
      This paper describes the system submitted to the SemEval 2019 shared task 1 ‘Cross-lingual Semantic Parsing with UCCA’. We rely on the semantic dependency parse trees provided in the shared task which are converted from the original UCCA files and model the task as tagging. The aim is to predict the graph structure of the output along with the types of relations among the nodes. Our proposed neural architecture is composed of Graph Convolution and BiLSTM components. The layers of the system share their weights while predicting dependency links and semantic labels. The system is applied to the CONLLU format of the input data and is best suited for semantic dependency parsing.
    • Gender and image sharing on Facebook, Twitter, Instagram, Snapchat and WhatsApp in the UK: Hobbying alone or filtering for friends?

      Thelwall, Mike; Vis, Farida (Emerald, 2017-10-01)
      Purpose: Despite the ongoing shift from text-based to image-based communication in the social web, supported by the affordances of smartphones, little is known about the new image sharing practices. Both gender and platform type seem likely to be important, but it is unclear how. Design/methodology/approach: This article surveys an age-balanced sample of UK Facebook, Twitter, Instagram, Snapchat and WhatsApp image sharers with a range of exploratory questions about platform use, privacy, interactions, technology use and profile pictures. Findings: Females shared photos more often overall and shared images more frequently on Snapchat, but males shared more images on Twitter, particularly for hobbies. Females also tended to have more privacy-related concerns but were more willing, in principle, to share pictures of their children. Females also interacted more through others’ images by liking and commenting on them. Both genders used supporting apps but in different ways: females applied filters and posted to albums whereas males retouched photos and used photo organising apps. Finally, males were more likely to be alone in their profile pictures. Practical implications: Those designing visual social web communication strategies to reach out to users should consider the different ways in which platforms are used by males and females to optimise their message for their target audience. Social implications: There are clear gender and platform differences in visual communication strategies. Overall, males may tend to have more informational, and females more relationship-based, skills or needs. Originality/value: This is the first detailed survey of electronic image sharing practices and the first to systematically compare the current generation of platforms.
    • Gender and research Publishing in India: Uniformly high inequality?

      Thelwall, Mike; Bailey, Carol; Makita, Meiko; Sud, Pardeep; Madalli, Devika P. (Elsevier, 2018-12-18)
      Gender inequalities have been a persistent feature of all modern societies. Although employment-related gender discrimination in various forms is legally prohibited, prejudice and violence against females have not been eradicated. Moreover, gendered social expectations can constrain the career choices of both males and females. Within academia, continuing gender imbalances have been found in many countries (Larivière, Ni, Gingras, Cronin, & Sugimoto, 2013), and particularly at senior levels (e.g., Ucal, O'Neil, & Toktas, 2015; Weisshaar, 2017; Winchester & Browning, 2015). India was the fifth largest research producer in 2017, according to Scopus, but has the highest United Nations Development Programme (UNDP) gender inequality index of the 30 largest research producers in Scopus (/hdr.undp.org/en/data) and so is an important case for global science. Moreover, the complex web of influences that have led to women being underrepresented in science in India is not well understood (Gupta, 2015). The absence of basic information about gender inequalities is a serious limitation because gender issues in India differ from the better researched case of the USA, due to economic conditions, probably stronger family influences (Vindhya, 2007), greater female safety concerns (Vindhya, 2007), and differing cultural expectations (Chandrakar, 2014).
    • Gender bias in machine learning for sentiment analysis

      Thelwall, Mike (Emerald Publishing Limited, 2018-01-01)
      Purpose: This paper investigates whether machine learning induces gender biases in the sense of results that are more accurate for male authors than for female authors. It also investigates whether training separate male and female variants could improve the accuracy of machine learning for sentiment analysis. Design/methodology/approach: This article uses ratings-balanced sets of reviews of restaurants and hotels (3 sets) to train algorithms with and without gender selection. Findings: Accuracy is higher on female-authored reviews than on male-authored reviews for all data sets, so applications of sentiment analysis using mixed gender datasets will over represent the opinions of women. Training on same gender data improves performance less than having additional data from both genders. Practical implications: End users of sentiment analysis should be aware that its small gender biases can affect the conclusions drawn from it and apply correction factors when necessary. Users of systems that incorporate sentiment analysis should be aware that performance will vary by author gender. Developers do not need to create gender-specific algorithms unless they have more training data than their system can cope with. Originality/value: This is the first demonstration of gender bias in machine learning sentiment analysis.
    • Gender bias in sentiment analysis

      Thelwall, Mike (Emerald, 2018-02-14)
      Purpose: To test if there are biases in lexical sentiment analysis accuracy between reviews authored by males and females. Design: This paper uses datasets of TripAdvisor reviews of hotels and restaurants in the UK written by UK residents to contrast the accuracy of lexical sentiment analysis for males and females. Findings: Male sentiment is harder to detect because it is less explicit. There was no evidence that this problem could be solved by gender-specific lexical sentiment analysis. Research limitations: Only one lexical sentiment analysis algorithm was used. Practical implications: Care should be taken when drawing conclusions about gender differences from automatic sentiment analysis results. When comparing opinions for product aspects that appeal differently to men and women, female sentiments are likely to be overrepresented, biasing the results. Originality/value: This is the first evidence that lexical sentiment analysis is less able to detect the opinions of one gender than another.
    • Gender differences in research areas, methods and topics: Can people and thing orientations explain the results?

      Thelwall, Mike; Bailey, Carol; Tobin, Catherine; Bradshaw, Noel-Ann (Elsevier, 2018-12-26)
      Although the gender gap in academia has narrowed, females are underrepresented within some fields in the USA. Prior research suggests that the imbalances between science, technology, engineering and mathematics fields may be partly due to greater male interest in things and greater female interest in people, or to off-putting masculine cultures in some disciplines. To seek more detailed insights across all subjects, this article compares practising US male and female researchers between and within 285 narrow Scopus fields inside 26 broad fields from their first-authored articles published in 2017. The comparison is based on publishing fields and the words used in article titles, abstracts, and keywords. The results cannot be fully explained by the people/thing dimensions. Exceptions include greater female interest in veterinary science and cell biology and greater male interest in abstraction, patients, and power/control fields, such as politics and law. These may be due to other factors, such as the ability of a career to provide status or social impact or the availability of alternative careers. As a possible side effect of the partial people/thing relationship, females are more likely to use exploratory and qualitative methods and males are more likely to use quantitative methods. The results suggest that the necessary steps of eliminating explicit and implicit gender bias in academia are insufficient and might be complemented by measures to make fields more attractive to minority genders.
    • Goodreads Reviews to Assess the Wider Impacts of Books

      Kousha, Kayvan; Thelwall, Mike; Abdoli, Mahshid (John Wiley & Sons, 2017-06-01)
      Although peer-review and citation counts are commonly used to help assess the scholarly impact of published research, informal reader feedback might also be exploited to help assess the wider impacts of books, such as their educational or cultural value. The social website Goodreads seems to be a reasonable source for this purpose because it includes a large number of book reviews and ratings by many users inside and outside of academia. To check this, Goodreads book metrics were compared with different book-based impact indicators for 15,928 academic books across broad fields. Goodreads engagements were numerous enough in the Arts (85% of books had at least one), Humanities (80%) and Social Sciences (67%) for use as a source of impact evidence. Low and moderate correlations between Goodreads book metrics and scholarly or non-scholarly indicators suggest that reader feedback in Goodreads reflects the many purposes of books rather than a single type of impact. Although Goodreads book metrics can be manipulated they could be used guardedly by academics, authors, and publishers in evaluations.
    • Goodreads: A social network site for book readers

      Thelwall, Mike; Kousha, Kayvan (John Wiley & Sons, Inc., 2016-12-21)
      Goodreads is an Amazon‐owned book‐based social web site for members to share books, read, review books, rate books, and connect with other readers. Goodreads has tens of millions of book reviews, recommendations, and ratings that may help librarians and readers to select relevant books. This article describes a first investigation of the properties of Goodreads users, using a random sample of 50,000 members. The results suggest that about three quarters of members with a public profile are female, and that there is little difference between male and female users in patterns of behavior, except for females registering more books and rating them less positively. Goodreads librarians and super‐users engage extensively with most features of the site. The absence of strong correlations between book‐based and social usage statistics (e.g., numbers of friends, followers, books, reviews, and ratings) suggests that members choose their own individual balance of social and book activities and rarely ignore one at the expense of the other. Goodreads is therefore neither primarily a book‐based website nor primarily a social network site but is a genuine hybrid, social navigation site.
    • Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories

      Martín-Martín, Alberto; Orduna-Malea, Enrique; Thelwall, Mike; Delgado López-Cózar, Emilio (Elsevier, 2018-10-05)
      Despite citation counts from Google Scholar (GS), Web of Science (WoS), and Scopus being widely consulted by researchers and sometimes used in research evaluations, there is no recent or systematic evidence about the differences between them. In response, this paper investigates 2,448,055 citations to 2299 English-language highly-cited documents from 252 GS subject categories published in 2006, comparing GS, the WoS Core Collection, and Scopus. GS consistently found the largest percentage of citations across all areas (93%–96%), far ahead of Scopus (35%–77%) and WoS (27%–73%). GS found nearly all the WoS (95%) and Scopus (92%) citations. Most citations found only by GS were from non-journal sources (48%–65%), including theses, books, conference papers, and unpublished materials. Many were non-English (19%–38%), and they tended to be much less cited than citing sources that were also in Scopus or WoS. Despite the many unique GS citing sources, Spearman correlations between citation counts in GS and WoS or Scopus are high (0.78-0.99). They are lower in the Humanities, and lower between GS and WoS than between GS and Scopus. The results suggest that in all areas GS citation data is essentially a superset of WoS and Scopus, with substantial extra coverage.
    • Grammatical annotation of historical Portuguese: Generating a corpus-based diachronic dictionary

      Bick, Eckhard; Zampieri, Marcos (Springer, 2016-09-03)
      In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our method allows to create tailor-made standardization dictionaries for historical Portuguese with optional period or author frequencies.
    • Graph structure in three national academic Webs: Power laws with anomalies

      Thelwall, Mike; Wilkinson, David (Wiley, 2003)
      The graph structures of three national university publicly indexable Webs from Australia, New Zealand, and the UK were analyzed. Strong scale-free regularities for page indegrees, outdegrees, and connected component sizes were in evidence, resulting in power laws similar to those previously identified for individual university Web sites and for the AltaVista-indexed Web. Anomalies were also discovered in most distributions and were tracked down to root causes. As a result, resource driven Web sites and automatically generated pages were identified as representing a significant break from the assumptions of previous power law models. It follows that attempts to track average Web linking behavior would benefit from using techniques to minimize or eliminate the impact of such anomalies.
    • A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English

      Sivakumar, Jasivan; Muga, Jake; Spadavecchia, Flavio; White, Daniel; Can Buglalilar, Burcu (IEEE, 2022-06-30)
      In this study, we propose a Gated Recurrent Unit (GRU) model to restore the following features: word and sentence boundaries, periods, commas, and capitalisation for unformatted English text. We approach feature restoration as a binary classification task where the model learns to predict whether a feature should be restored or not. A pipeline approach is proposed, in which only one feature (word boundary, sentence boundary, punctuation, capitalisation) is restored in each component of the pipeline model. To optimise the model, we conducted a grid search on the parameters. The effect of changing the order of the pipeline is also investigated experimentally; PERIODS > COMMAS > SPACES > CASING yielded the best result. Our findings highlight several specifcaction points with optimisation potential to be targeted in follow-up research.
    • Guideline references and academic citations as evidence of the clinical value of health research

      Thelwall, Mike; Maflahi, Nabeil; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY United Kingdom (John Wiley & Sons, Ltd, 2015-03-17)
      This article introduces a new source of evidence of the value of medical-related research: citations from clinical guidelines. These give evidence that research findings have been used to inform the day-to-day practice of medical staff. To identify whether citations from guidelines can give different information from that of traditional citation counts, this article assesses the extent to which references in clinical guidelines tend to be highly cited in the academic literature and highly read in Mendeley. Using evidence from the United Kingdom, references associated with the UK's National Institute of Health and Clinical Excellence (NICE) guidelines tended to be substantially more cited than comparable articles, unless they had been published in the most recent 3 years. Citation counts also seemed to be stronger indicators than Mendeley readership altmetrics. Hence, although presence in guidelines may be particularly useful to highlight the contributions of recently published articles, for older articles citation counts may already be sufficient to recognize their contributions to health in society.
    • Guiding neural machine translation decoding with external knowledge

      Chatterjee, Rajen; Negri, Matteo; Turchi, Marco; Federico, Marcello; Specia, Lucia; Blain, Frédéric (Association for Computational Linguistics, 2017-09)
      Chatterjee, R., Negri, M., Turchi, M., Federico, M. et al. (2017) Guiding neural machine translation decoding with external knowledge. In, Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Bojar, O., Buck, C., Chatterjee, R., Federmann, C. et al. (eds.) Stroudsburg, PA: Association for Computational Linguistics, pp. 157-168.
    • Handling cross and out-of-domain samples in Thai word segmentation

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-08-01)
      While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.
    • Herramientas y recursos electrónicos para la traducción de la manipulación fraseológica: un estudio de caso centrado en el estudiante

      Hidalgo Ternero, Carlos Manuel; Corpas Pastor, Gloria (Ediciones Universidad de Salamanca, 2021-05-13)
      En el presente artículo se analiza un estudio de caso llevado a cabo con estudiantes de la asignatura Traducción General «BA-AB» (II) - Inglés-Español / EspañolInglés, impartida en el segundo semestre del segundo curso del Grado en Traducción e Interpretación de la Universidad de Málaga. En él, en una primera fase, se les enseñó a los estudiantes cómo sacar el máximo partido de diferentes recursos y herramientas documentales electrónicos (corpus lingüísticos, recursos lexicográficos o la web, entre otros) para la creación de equivalencias textuales en aquellos casos en los que, fruto del anisomorfismo fraseológico interlingüe, la modificación creativa de unidades fraseológicas (UF) en el texto origen y la ausencia de correspondencias biunívocas presentan serias dificultades para el proceso traslaticio. De esta manera, a una primera actividad formativa sobre la traducción de usos creativos de unidades fraseológicas le sucede una sesión práctica en la que los alumnos tuvieron que enfrentarse a distintos casos de manipulación en el texto origen. Con el análisis de dichos resultados se podrá vislumbrar en qué medida los distintos recursos documentales ayudan a los traductores en formación a superar el desafío de la manipulación fraseológica
    • How quickly do publications get read? The evolution of Mendeley reader counts for new articles

      Maflahi, Nabeil; Thelwall, Mike (Wiley-Blackwell, 2017-08-29)
      Within science, citation counts are widely used to estimate research impact but publication delays mean that they are not useful for recent research. This gap can be filled by Mendeley reader counts, which are valuable early impact indicators for academic articles because they appear before citations and correlate strongly with them. Nevertheless, it is not known how Mendeley readership counts accumulate within the year of publication, and so it is unclear how soon they can be used. In response, this paper reports a longitudinal weekly study of the Mendeley readers of articles in six library and information science journals from 2016. The results suggest that Mendeley readers accrue from when articles are first available online and continue to steadily build. For journals with large publication delays, articles can already have substantial numbers of readers by their publication date. Thus, Mendeley reader counts may even be useful as early impact indicators for articles before they have been officially published in a journal issue. If field normalised indicators are needed, then these can be generated when journal issues are published using the online first date.
    • Hybrid Arabic–French machine translation using syntactic re-ordering and morphological pre-processing

      Mohamed, Emad; Sadat, Fatiha (Elsevier BV, 2014-11-08)
      Arabic is a highly inflected language and a morpho-syntactically complex language with many differences compared to several languages that are heavily studied. It may thus require good pre-processing as it presents significant challenges for Natural Language Processing (NLP), specifically for Machine Translation (MT). This paper aims to examine how Statistical Machine Translation (SMT) can be improved using rule-based pre-processing and language analysis. We describe a hybrid translation approach coupling an Arabic–French statistical machine translation system using the Moses decoder with additional morphological rules that reduce the morphology of the source language (Arabic) to a level that makes it closer to that of the target language (French). Moreover, we introduce additional swapping rules for a structural matching between the source language and the target language. Two structural changes involving the positions of the pronouns and verbs in both the source and target languages have been attempted. The results show an improvement in the quality of translation and a gain in terms of BLEU score after introducing a pre-processing scheme for Arabic and applying these rules based on morphological variations and verb re-ordering (VS into SV constructions) in the source language (Arabic) according to their positions in the target language (French). Furthermore, a learning curve shows the improvement in terms on BLEU score under scarce- and large-resources conditions. The proposed approach is completed without increasing the amount of training data or radically changing the algorithms that can affect the translation or training engines.
    • Hyperlinks as a data source for science mapping

      Harries, Gareth; Wilkinson, David; Price, Liz; Fairclough, Ruth; Thelwall, Mike (Sage, 2004)
      Hyperlinks between academic web sites, like citations, can potentially be used to map disciplinary structures and identify evidence of connections between disciplines. In this paper we classified a sample of links originating in three different disciplines: maths, physics and sociology. Links within a discipline were found to be different in character to links between pages in different disciplines. There were also disciplinary differences in both types of link. As a consequence, we argue that interpretations of web science maps covering multiple disciplines will need to be sensitive to the contexts of the links mapped.
    • Identification of multiword expressions: A fresh look at modelling and evaluation

      Taslimipoor, Shiva; Rohanian, Omid; Mitkov, Ruslan; Fazly, Afsaneh; Markantonatou, Stella; Ramisch, Carlos; Savary, Agata; Vincze, Veronika (Language Science Press, 2018-10-25)