• Identification of multiword expressions: A fresh look at modelling and evaluation

      Taslimipoor, Shiva; Rohanian, Omid; Mitkov, Ruslan; Fazly, Afsaneh; Markantonatou, Stella; Ramisch, Carlos; Savary, Agata; Vincze, Veronika (Language Science Press, 2018-10-25)
    • Identification of translationese: a machine learning approach

      Ilisei, Iustina; Inkpen, Diana; Corpas Pastor, Gloria; Mitkov, Ruslan; Gelbukh, A (Springer, 2010)
      This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.
    • Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

      Evans, Richard; Orasan, Constantin (Cambridge University Press, 2018-10-31)
      This article presents a new method to automatically simplify English sentences. The approach is designed to reduce the number of compound clauses and nominally bound relative clauses in input sentences. The article provides an overview of a corpus annotated with information about various explicit signs of syntactic complexity and describes the two major components of a sentence simplification method that works by exploiting information on the signs occurring in the sentences of a text. The first component is a sign tagger which automatically classifies signs in accordance with the annotation scheme used to annotate the corpus. The second component is an iterative rule-based sentence transformation tool. Exploiting the sign tagger in conjunction with other NLP components, the sentence transformation tool automatically rewrites long sentences containing compound clauses and nominally bound relative clauses as sequences of shorter single-clause sentences. Evaluation of the different components reveals acceptable performance in rewriting sentences containing compound clauses but less accuracy when rewriting sentences containing nominally bound relative clauses. A detailed error analysis revealed that the major sources of error include inaccurate sign tagging, the relatively limited coverage of the rules used to rewrite sentences, and an inability to discriminate between various subtypes of clause coordination. Despite this, the system performed well in comparison with two baselines. This finding was reinforced by automatic estimations of the readability of system output and by surveys of readers’ opinions about the accuracy, accessibility, and meaning of this output.
    • Improving translation memory matching and retrieval using paraphrases

      Gupta, Rohit; Orasan, Constantin; Zampieri, Marcos; Vela, Mihaela; van Genabith, Josef; Mitkov, Ruslan (Springer Nature, 2016-11-02)
      Most of the current Translation Memory (TM) systems work on string level (character or word level) and lack semantic knowledge while matching. They use simple edit-distance calculated on surface-form or some variation on it (stem, lemma), which does not take into consideration any semantic aspects in matching. This paper presents a novel and efficient approach to incorporating semantic information in the form of paraphrasing in the edit-distance metric. The approach computes edit-distance while efficiently considering paraphrases using dynamic programming and greedy approximation. In addition to using automatic evaluation metrics like BLEU and METEOR, we have carried out an extensive human evaluation in which we measured post-editing time, keystrokes, HTER, HMETEOR, and carried out three rounds of subjective evaluations. Our results show that paraphrasing substantially improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase-enhanced TMs.
    • Intelligent Natural Language Processing: Trends and Applications

      Orăsan, Constantin; Evans, Richard; Mitkov, Ruslan (Springer, 2017)
      Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.
    • Interpreting correlations between citation counts and other indicators

      Thelwall, Mike (Springer, 2016-05-09)
      Altmetrics or other indicators for the impact of academic outputs are often correlated with citation counts in order to help assess their value. Nevertheless, there are no guidelines about how to assess the strengths of the correlations found. This is a problem because this value affects the conclusions that should be drawn. In response, this article uses experimental simulations to assess the correlation strengths to be expected under various different conditions. The results show that the correlation strength reflects not only the underlying degree of association but also the average magnitude of the numbers involved. Overall, the results suggest that due to the number of assumptions that must be made in practice it will rarely be possible to make a realistic interpretation of the strength of a correlation coefficient.
    • Interpreting social science link analysis research: A theoretical framework

      Thelwall, Mike (Wiley, 2006)
      Link analysis in various forms is now an established technique in many different subjects, reflecting the perceived importance of links and of the Web. A critical but very difficult issue is how to interpret the results of social science link analyses. It is argued that the dynamic nature of the Web, its lack of quality control, and the online proliferation of copying and imitation mean that methodologies operating within a highly positivist, quantitative framework are ineffective. Conversely, the sheer variety of the Web makes application of qualitative methodologies and pure reason very problematic to large-scale studies. Methodology triangulation is consequently advocated, in combination with a warning that the Web is incapable of giving definitive answers to large-scale link analysis research questions concerning social factors underlying link creation. Finally, it is claimed that although theoretical frameworks are appropriate for guiding research, a Theory of Link Analysis is not possible.
    • Is Medical Research Informing Professional Practice More Highly Cited? Evidence from AHFS DI Essentials in Drugs.com

      Thelwall, Mike; Kousha, Kayvan; Abdoli, Mahshid (Springer, 2017-02-21)
      Citation-based indicators are often used to help evaluate the impact of published medical studies, even though the research has the ultimate goal of improving human wellbeing. One direct way of influencing health outcomes is by guiding physicians and other medical professionals about which drugs to prescribe. A high profile source of this guidance is the AHFS DI Essentials product of the American Society of Health-System Pharmacists, which gives systematic information for drug prescribers. AHFS DI Essentials documents, which are also indexed by Drugs.com, include references to academic studies and the referenced work is therefore helping patients by guiding drug prescribing. This article extracts AHFS DI Essentials documents from Drugs.com and assesses whether articles referenced in these information sheets have their value recognised by higher Scopus citation counts. A comparison of mean log-transformed citation counts between articles that are and are not referenced in AHFS DI Essentials shows that AHFS DI Essentials references are more highly cited than average for the publishing journal. This suggests that medical research influencing drug prescribing is more cited than average.
    • Language evolution and the spread of ideas on the Web: A procedure for identifying emergent hybrid word family members

      Thelwall, Mike; Price, Liz (Wiley, 2006)
      Word usage is of interest to linguists for its own sake as well as to social scientists and others who seek to track the spread of ideas, for example, in public debates over political decisions. The historical evolution of language can be analyzed with the tools of corpus linguistics through evolving corpora and the Web. But word usage statistics can only be gathered for known words. In this article, techniques are described and tested for identifying new words from the Web, focusing on the case when the words are related to a topic and have a hybrid form with a common sequence of letters. The results highlight the need to employ a combination of search techniques and show the wide potential of hybrid word family investigations in linguistics and social science.
    • Language resources for Italian: Towards the development of a corpus of annotated Italian multiword expressions

      Taslimipoor, Shiva; Desantis, Anna; Cherchi, Manuela; Mitkov, Ruslan; Monti, Johanna (ceur-ws, 2016-12-05)
      This paper describes the first resource annotated for multiword expressions (MWEs) in Italian. Two versions of this dataset have been prepared: the first with a fast markup list of out-of-context MWEs, and the second with an in-context annotation, where the MWEs are entered with their contexts. The paper also discusses annotation issues and reports the inter-annotator agreement for both types of annotations. Finally, the results of the first exploitation of the new resource, namely the automatic extraction of Italian MWEs, are presented.
    • Large-scale data harvesting for biographical data

      Plum, Alistair; Zampieri, Marcos; Orasan, Constantin; Wandl-Vogt, Eveline; Mitkov, R (CEUR, 2019-09-05)
      This paper explores automatic methods to identify relevant biography candidates in large databases, and extract biographical information from encyclopedia entries and databases. In this work, relevant candidates are defined as people who have made an impact in a certain country or region within a pre-defined time frame. We investigate the case of people who had an impact in the Republic of Austria and died between 1951 and 2019. We use Wikipedia and Wikidata as data sources and compare the performance of our information extraction methods on these two databases. We demonstrate the usefulness of a natural language processing pipeline to identify suitable biography candidates and, in a second stage, extract relevant information about them. Even though they are considered by many as an identical resource, our results show that the data from Wikipedia and Wikidata differs in some cases and they can be used in a complementary way providing more data for the compilation of biographies.
    • Leveraging large corpora for translation using the Sketch Engine

      Moze, Sarah; Krek, Simon (Cambridge University Press, 2018)
    • Linguistic features of genre and method variation in translation: A computational perspective

      Lapshinova-Koltunski, Ekaterina; Zampieri, Marcos; Legallois, Dominique; Charnois, Thierry; Larjavaara, Meri (Mouton De Gruyter, 2018-04-09)
      In this contribution we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this study is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation.
    • Linguistic patterns of academic Web use in Western Europe

      Thelwall, Mike; Tang, Rong; Price, Liz (Springer, 2003)
      A survey of linguistic dimensions of Web site hosting and interlinking of the universities of sixteen European countries is described. The results show that English is the dominant language both for linking pages and for all pages. In a typical country approximately half the pages were in English and half in one or more national languages. Normalised interlinking patterns showed three trends: 1) international interlinking throughout Europe in English, and additionally in Swedish in Scandinavia; 2) linking between countries sharing a common language, and 3) countries extensively hosting international links in their own major languages. This provides evidence for the multilingual character of academic use of the Web in Western Europe, at least outside the UK and Eire. Evidence was found that Greece was significantly linguistically isolated from the rest of the EU but that outsiders Norway and Switzerland were not.
    • Linking Verb Pattern Dictionaries of English and Spanish

      Baisa, Vít; Moze, Sara; Renau, Irene (ELRA, 2016-05-24)
      The paper presents the first step in the creation of a new multilingual and corpus-driven lexical resource by means of linking existing monolingual pattern dictionaries of English and Spanish verbs. The two dictionaries were compiled through Corpus Pattern Analysis (CPA) – an empirical procedure in corpus linguistics that associates word meaning with word use by means of analysis of phraseological patterns and collocations found in corpus data. This paper provides a first look into a number of practical issues arising from the task of linking corresponding patterns across languages via both manual and automatic procedures. In order to facilitate manual pattern linking, we implemented a heuristic-based algorithm to generate automatic suggestions for candidate verb pattern pairs, which obtained 80% precision. Our goal is to kick-start the development of a new resource for verbs that can be used by language learners, translators, editors and the research community alike.
    • Long term productivity and collaboration in information science

      Thelwall, Mike; Levitt, Jonathan (Springer, 2016-07-02)
      Funding bodies have tended to encourage collaborative research because it is generally more highly cited than sole author research. But higher mean citation for collaborative articles does not imply collaborative researchers are in general more research productive. This article assesses the extent to which research productivity varies with the number of collaborative partners for long term researchers within three Web of Science subject areas: Information Science & Library Science, Communication and Medical Informatics. When using the whole number counting system, researchers who worked in groups of 2 or 3 were generally the most productive, in terms of producing the most papers and citations. However, when using fractional counting, researchers who worked in groups of 1 or 2 were generally the most productive. The findings need to be interpreted cautiously, however, because authors that produce few academic articles within a field may publish in other fields or leave academia and contribute to society in other ways.
    • Mendeley readership altmetrics for medical articles: An analysis of 45 fields

      Wilson, Paul; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1LY UK (Wiley Blackwell, 2015-05)
    • Methodologies for crawler based Web surveys.

      Thelwall, Mike (MCB UP Ltd, 2002)
      There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well-known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.
    • Microsoft Academic automatic document searches: accuracy for journal articles and suitability for citation analysis

      Thelwall, Mike (Elsevier, 2017-11-22)
      Microsoft Academic is a free academic search engine and citation index that is similar to Google Scholar but can be automatically queried. Its data is potentially useful for bibliometric analysis if it is possible to search effectively for individual journal articles. This article compares different methods to find journal articles in its index by searching for a combination of title, authors, publication year and journal name and uses the results for the widest published correlation analysis of Microsoft Academic citation counts for journal articles so far. Based on 126,312 articles from 323 Scopus subfields in 2012, the optimal strategy to find articles with DOIs is to search for them by title and filter out those with incorrect DOIs. This finds 90% of journal articles. For articles without DOIs, the optimal strategy is to search for them by title and then filter out matches with dissimilar metadata. This finds 89% of journal articles, with an additional 1% incorrect matches. The remaining articles seem to be mainly not indexed by Microsoft Academic or indexed with a different language version of their title. From the matches, Scopus citation counts and Microsoft Academic counts have an average Spearman correlation of 0.95, with the lowest for any single field being 0.63. Thus, Microsoft Academic citation counts are almost universally equivalent to Scopus citation counts for articles that are not recent but there are national biases in the results.
    • Monitoring Twitter strategies to discover resonating topics: The case of the UNDP

      Thelwall, Mike; Cugelman, Brian (EPI - El Profesional de la información., 2017-08-02)
      Many organizations use social media to attract supporters, disseminate information and advocate change. Services like Twitter can theoretically deliver messages to a huge audience that would be difficult to reach by other means. This article introduces a method to monitor an organization’s Twitter strategy and applies it to tweets from United Nations Development Programme (UNDP) accounts. The Resonating Topic Method uses automatic analyses with free software to detect successful themes within the organization’s tweets, categorizes the most successful tweets, and analyses a comparable organization to identify new successful strategies. In the case of UNDP tweets from November 2014 to March 2015, the results confirm the importance of official social media accounts as well as those of high profile individuals and general supporters. Official accounts seem to be more successful at encouraging action, which is a critical aspect of social media campaigning. An analysis of Oxfam found a successful social media approach that the UNDP had not adopted, showing the value of analyzing other organizations to find potential strategy gaps.