• How quickly do publications get read? The evolution of Mendeley reader counts for new articles

      Maflahi, Nabeil; Thelwall, Mike (Wiley-Blackwell, 2017-08-29)
      Within science, citation counts are widely used to estimate research impact but publication delays mean that they are not useful for recent research. This gap can be filled by Mendeley reader counts, which are valuable early impact indicators for academic articles because they appear before citations and correlate strongly with them. Nevertheless, it is not known how Mendeley readership counts accumulate within the year of publication, and so it is unclear how soon they can be used. In response, this paper reports a longitudinal weekly study of the Mendeley readers of articles in six library and information science journals from 2016. The results suggest that Mendeley readers accrue from when articles are first available online and continue to steadily build. For journals with large publication delays, articles can already have substantial numbers of readers by their publication date. Thus, Mendeley reader counts may even be useful as early impact indicators for articles before they have been officially published in a journal issue. If field normalised indicators are needed, then these can be generated when journal issues are published using the online first date.
    • Hybrid Arabic–French machine translation using syntactic re-ordering and morphological pre-processing

      Mohamed, Emad; Sadat, Fatiha (Elsevier BV, 2014-11-08)
      Arabic is a highly inflected language and a morpho-syntactically complex language with many differences compared to several languages that are heavily studied. It may thus require good pre-processing as it presents significant challenges for Natural Language Processing (NLP), specifically for Machine Translation (MT). This paper aims to examine how Statistical Machine Translation (SMT) can be improved using rule-based pre-processing and language analysis. We describe a hybrid translation approach coupling an Arabic–French statistical machine translation system using the Moses decoder with additional morphological rules that reduce the morphology of the source language (Arabic) to a level that makes it closer to that of the target language (French). Moreover, we introduce additional swapping rules for a structural matching between the source language and the target language. Two structural changes involving the positions of the pronouns and verbs in both the source and target languages have been attempted. The results show an improvement in the quality of translation and a gain in terms of BLEU score after introducing a pre-processing scheme for Arabic and applying these rules based on morphological variations and verb re-ordering (VS into SV constructions) in the source language (Arabic) according to their positions in the target language (French). Furthermore, a learning curve shows the improvement in terms on BLEU score under scarce- and large-resources conditions. The proposed approach is completed without increasing the amount of training data or radically changing the algorithms that can affect the translation or training engines.
    • Hyperlinks as a data source for science mapping

      Harries, Gareth; Wilkinson, David; Price, Liz; Fairclough, Ruth; Thelwall, Mike (Sage, 2004)
      Hyperlinks between academic web sites, like citations, can potentially be used to map disciplinary structures and identify evidence of connections between disciplines. In this paper we classified a sample of links originating in three different disciplines: maths, physics and sociology. Links within a discipline were found to be different in character to links between pages in different disciplines. There were also disciplinary differences in both types of link. As a consequence, we argue that interpretations of web science maps covering multiple disciplines will need to be sensitive to the contexts of the links mapped.
    • Identification of multiword expressions: A fresh look at modelling and evaluation

      Taslimipoor, Shiva; Rohanian, Omid; Mitkov, Ruslan; Fazly, Afsaneh; Markantonatou, Stella; Ramisch, Carlos; Savary, Agata; Vincze, Veronika (Language Science Press, 2018-10-25)
    • Identification of translationese: a machine learning approach

      Ilisei, Iustina; Inkpen, Diana; Corpas Pastor, Gloria; Mitkov, Ruslan; Gelbukh, A (Springer, 2010)
      This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.
    • Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

      Evans, Richard; Orasan, Constantin (Cambridge University Press, 2018-10-31)
      This article presents a new method to automatically simplify English sentences. The approach is designed to reduce the number of compound clauses and nominally bound relative clauses in input sentences. The article provides an overview of a corpus annotated with information about various explicit signs of syntactic complexity and describes the two major components of a sentence simplification method that works by exploiting information on the signs occurring in the sentences of a text. The first component is a sign tagger which automatically classifies signs in accordance with the annotation scheme used to annotate the corpus. The second component is an iterative rule-based sentence transformation tool. Exploiting the sign tagger in conjunction with other NLP components, the sentence transformation tool automatically rewrites long sentences containing compound clauses and nominally bound relative clauses as sequences of shorter single-clause sentences. Evaluation of the different components reveals acceptable performance in rewriting sentences containing compound clauses but less accuracy when rewriting sentences containing nominally bound relative clauses. A detailed error analysis revealed that the major sources of error include inaccurate sign tagging, the relatively limited coverage of the rules used to rewrite sentences, and an inability to discriminate between various subtypes of clause coordination. Despite this, the system performed well in comparison with two baselines. This finding was reinforced by automatic estimations of the readability of system output and by surveys of readers’ opinions about the accuracy, accessibility, and meaning of this output.
    • Improving translation memory matching and retrieval using paraphrases

      Gupta, Rohit; Orasan, Constantin; Zampieri, Marcos; Vela, Mihaela; van Genabith, Josef; Mitkov, Ruslan (Springer Nature, 2016-11-02)
      Most of the current Translation Memory (TM) systems work on string level (character or word level) and lack semantic knowledge while matching. They use simple edit-distance calculated on surface-form or some variation on it (stem, lemma), which does not take into consideration any semantic aspects in matching. This paper presents a novel and efficient approach to incorporating semantic information in the form of paraphrasing in the edit-distance metric. The approach computes edit-distance while efficiently considering paraphrases using dynamic programming and greedy approximation. In addition to using automatic evaluation metrics like BLEU and METEOR, we have carried out an extensive human evaluation in which we measured post-editing time, keystrokes, HTER, HMETEOR, and carried out three rounds of subjective evaluations. Our results show that paraphrasing substantially improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase-enhanced TMs.
    • Intelligent Natural Language Processing: Trends and Applications

      Orăsan, Constantin; Evans, Richard; Mitkov, Ruslan (Springer, 2017)
      Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.
    • Interpreting correlations between citation counts and other indicators

      Thelwall, Mike (Springer, 2016-05-09)
      Altmetrics or other indicators for the impact of academic outputs are often correlated with citation counts in order to help assess their value. Nevertheless, there are no guidelines about how to assess the strengths of the correlations found. This is a problem because this value affects the conclusions that should be drawn. In response, this article uses experimental simulations to assess the correlation strengths to be expected under various different conditions. The results show that the correlation strength reflects not only the underlying degree of association but also the average magnitude of the numbers involved. Overall, the results suggest that due to the number of assumptions that must be made in practice it will rarely be possible to make a realistic interpretation of the strength of a correlation coefficient.
    • Interpreting social science link analysis research: A theoretical framework

      Thelwall, Mike (Wiley, 2006)
      Link analysis in various forms is now an established technique in many different subjects, reflecting the perceived importance of links and of the Web. A critical but very difficult issue is how to interpret the results of social science link analyses. It is argued that the dynamic nature of the Web, its lack of quality control, and the online proliferation of copying and imitation mean that methodologies operating within a highly positivist, quantitative framework are ineffective. Conversely, the sheer variety of the Web makes application of qualitative methodologies and pure reason very problematic to large-scale studies. Methodology triangulation is consequently advocated, in combination with a warning that the Web is incapable of giving definitive answers to large-scale link analysis research questions concerning social factors underlying link creation. Finally, it is claimed that although theoretical frameworks are appropriate for guiding research, a Theory of Link Analysis is not possible.
    • Is Medical Research Informing Professional Practice More Highly Cited? Evidence from AHFS DI Essentials in Drugs.com

      Thelwall, Mike; Kousha, Kayvan; Abdoli, Mahshid (Springer, 2017-02-21)
      Citation-based indicators are often used to help evaluate the impact of published medical studies, even though the research has the ultimate goal of improving human wellbeing. One direct way of influencing health outcomes is by guiding physicians and other medical professionals about which drugs to prescribe. A high profile source of this guidance is the AHFS DI Essentials product of the American Society of Health-System Pharmacists, which gives systematic information for drug prescribers. AHFS DI Essentials documents, which are also indexed by Drugs.com, include references to academic studies and the referenced work is therefore helping patients by guiding drug prescribing. This article extracts AHFS DI Essentials documents from Drugs.com and assesses whether articles referenced in these information sheets have their value recognised by higher Scopus citation counts. A comparison of mean log-transformed citation counts between articles that are and are not referenced in AHFS DI Essentials shows that AHFS DI Essentials references are more highly cited than average for the publishing journal. This suggests that medical research influencing drug prescribing is more cited than average.
    • Language evolution and the spread of ideas on the Web: A procedure for identifying emergent hybrid word family members

      Thelwall, Mike; Price, Liz (Wiley, 2006)
      Word usage is of interest to linguists for its own sake as well as to social scientists and others who seek to track the spread of ideas, for example, in public debates over political decisions. The historical evolution of language can be analyzed with the tools of corpus linguistics through evolving corpora and the Web. But word usage statistics can only be gathered for known words. In this article, techniques are described and tested for identifying new words from the Web, focusing on the case when the words are related to a topic and have a hybrid form with a common sequence of letters. The results highlight the need to employ a combination of search techniques and show the wide potential of hybrid word family investigations in linguistics and social science.
    • Language resources for Italian: Towards the development of a corpus of annotated Italian multiword expressions

      Taslimipoor, Shiva; Desantis, Anna; Cherchi, Manuela; Mitkov, Ruslan; Monti, Johanna (ceur-ws, 2016-12-05)
      This paper describes the first resource annotated for multiword expressions (MWEs) in Italian. Two versions of this dataset have been prepared: the first with a fast markup list of out-of-context MWEs, and the second with an in-context annotation, where the MWEs are entered with their contexts. The paper also discusses annotation issues and reports the inter-annotator agreement for both types of annotations. Finally, the results of the first exploitation of the new resource, namely the automatic extraction of Italian MWEs, are presented.
    • Large-scale data harvesting for biographical data

      Plum, Alistair; Zampieri, Marcos; Orasan, Constantin; Wandl-Vogt, Eveline; Mitkov, R (CEUR, 2019-09-05)
      This paper explores automatic methods to identify relevant biography candidates in large databases, and extract biographical information from encyclopedia entries and databases. In this work, relevant candidates are defined as people who have made an impact in a certain country or region within a pre-defined time frame. We investigate the case of people who had an impact in the Republic of Austria and died between 1951 and 2019. We use Wikipedia and Wikidata as data sources and compare the performance of our information extraction methods on these two databases. We demonstrate the usefulness of a natural language processing pipeline to identify suitable biography candidates and, in a second stage, extract relevant information about them. Even though they are considered by many as an identical resource, our results show that the data from Wikipedia and Wikidata differs in some cases and they can be used in a complementary way providing more data for the compilation of biographies.
    • Leveraging large corpora for translation using the Sketch Engine

      Moze, Sarah; Krek, Simon (Cambridge University Press, 2018)
    • Linguistic features of genre and method variation in translation: A computational perspective

      Lapshinova-Koltunski, Ekaterina; Zampieri, Marcos; Legallois, Dominique; Charnois, Thierry; Larjavaara, Meri (Mouton De Gruyter, 2018-04-09)
      In this contribution we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this study is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation.
    • Linguistic patterns of academic Web use in Western Europe

      Thelwall, Mike; Tang, Rong; Price, Liz (Springer, 2003)
      A survey of linguistic dimensions of Web site hosting and interlinking of the universities of sixteen European countries is described. The results show that English is the dominant language both for linking pages and for all pages. In a typical country approximately half the pages were in English and half in one or more national languages. Normalised interlinking patterns showed three trends: 1) international interlinking throughout Europe in English, and additionally in Swedish in Scandinavia; 2) linking between countries sharing a common language, and 3) countries extensively hosting international links in their own major languages. This provides evidence for the multilingual character of academic use of the Web in Western Europe, at least outside the UK and Eire. Evidence was found that Greece was significantly linguistically isolated from the rest of the EU but that outsiders Norway and Switzerland were not.
    • Linking Verb Pattern Dictionaries of English and Spanish

      Baisa, Vít; Moze, Sara; Renau, Irene (ELRA, 2016-05-24)
      The paper presents the first step in the creation of a new multilingual and corpus-driven lexical resource by means of linking existing monolingual pattern dictionaries of English and Spanish verbs. The two dictionaries were compiled through Corpus Pattern Analysis (CPA) – an empirical procedure in corpus linguistics that associates word meaning with word use by means of analysis of phraseological patterns and collocations found in corpus data. This paper provides a first look into a number of practical issues arising from the task of linking corresponding patterns across languages via both manual and automatic procedures. In order to facilitate manual pattern linking, we implemented a heuristic-based algorithm to generate automatic suggestions for candidate verb pattern pairs, which obtained 80% precision. Our goal is to kick-start the development of a new resource for verbs that can be used by language learners, translators, editors and the research community alike.
    • Long term productivity and collaboration in information science

      Thelwall, Mike; Levitt, Jonathan (Springer, 2016-07-02)
      Funding bodies have tended to encourage collaborative research because it is generally more highly cited than sole author research. But higher mean citation for collaborative articles does not imply collaborative researchers are in general more research productive. This article assesses the extent to which research productivity varies with the number of collaborative partners for long term researchers within three Web of Science subject areas: Information Science & Library Science, Communication and Medical Informatics. When using the whole number counting system, researchers who worked in groups of 2 or 3 were generally the most productive, in terms of producing the most papers and citations. However, when using fractional counting, researchers who worked in groups of 1 or 2 were generally the most productive. The findings need to be interpreted cautiously, however, because authors that produce few academic articles within a field may publish in other fields or leave academia and contribute to society in other ways.
    • Mendeley readership altmetrics for medical articles: An analysis of 45 fields

      Wilson, Paul; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK (Wiley Blackwell, 2015-05)
      2330-1643