• Handling cross and out-of-domain samples in Thai word segmentation

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2021-08-01)
      While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.
    • Herramientas y recursos electrónicos para la traducción de la manipulación fraseológica: un estudio de caso centrado en el estudiante

      Hidalgo Ternero, Carlos Manuel; Corpas Pastor, Gloria (Ediciones Universidad de Salamanca, 2021-05-13)
      En el presente artículo se analiza un estudio de caso llevado a cabo con estudiantes de la asignatura Traducción General «BA-AB» (II) - Inglés-Español / EspañolInglés, impartida en el segundo semestre del segundo curso del Grado en Traducción e Interpretación de la Universidad de Málaga. En él, en una primera fase, se les enseñó a los estudiantes cómo sacar el máximo partido de diferentes recursos y herramientas documentales electrónicos (corpus lingüísticos, recursos lexicográficos o la web, entre otros) para la creación de equivalencias textuales en aquellos casos en los que, fruto del anisomorfismo fraseológico interlingüe, la modificación creativa de unidades fraseológicas (UF) en el texto origen y la ausencia de correspondencias biunívocas presentan serias dificultades para el proceso traslaticio. De esta manera, a una primera actividad formativa sobre la traducción de usos creativos de unidades fraseológicas le sucede una sesión práctica en la que los alumnos tuvieron que enfrentarse a distintos casos de manipulación en el texto origen. Con el análisis de dichos resultados se podrá vislumbrar en qué medida los distintos recursos documentales ayudan a los traductores en formación a superar el desafío de la manipulación fraseológica
    • How quickly do publications get read? The evolution of Mendeley reader counts for new articles

      Maflahi, Nabeil; Thelwall, Mike (Wiley-Blackwell, 2017-08-29)
      Within science, citation counts are widely used to estimate research impact but publication delays mean that they are not useful for recent research. This gap can be filled by Mendeley reader counts, which are valuable early impact indicators for academic articles because they appear before citations and correlate strongly with them. Nevertheless, it is not known how Mendeley readership counts accumulate within the year of publication, and so it is unclear how soon they can be used. In response, this paper reports a longitudinal weekly study of the Mendeley readers of articles in six library and information science journals from 2016. The results suggest that Mendeley readers accrue from when articles are first available online and continue to steadily build. For journals with large publication delays, articles can already have substantial numbers of readers by their publication date. Thus, Mendeley reader counts may even be useful as early impact indicators for articles before they have been officially published in a journal issue. If field normalised indicators are needed, then these can be generated when journal issues are published using the online first date.
    • Hybrid Arabic–French machine translation using syntactic re-ordering and morphological pre-processing

      Mohamed, Emad; Sadat, Fatiha (Elsevier BV, 2014-11-08)
      Arabic is a highly inflected language and a morpho-syntactically complex language with many differences compared to several languages that are heavily studied. It may thus require good pre-processing as it presents significant challenges for Natural Language Processing (NLP), specifically for Machine Translation (MT). This paper aims to examine how Statistical Machine Translation (SMT) can be improved using rule-based pre-processing and language analysis. We describe a hybrid translation approach coupling an Arabic–French statistical machine translation system using the Moses decoder with additional morphological rules that reduce the morphology of the source language (Arabic) to a level that makes it closer to that of the target language (French). Moreover, we introduce additional swapping rules for a structural matching between the source language and the target language. Two structural changes involving the positions of the pronouns and verbs in both the source and target languages have been attempted. The results show an improvement in the quality of translation and a gain in terms of BLEU score after introducing a pre-processing scheme for Arabic and applying these rules based on morphological variations and verb re-ordering (VS into SV constructions) in the source language (Arabic) according to their positions in the target language (French). Furthermore, a learning curve shows the improvement in terms on BLEU score under scarce- and large-resources conditions. The proposed approach is completed without increasing the amount of training data or radically changing the algorithms that can affect the translation or training engines.
    • Hyperlinks as a data source for science mapping

      Harries, Gareth; Wilkinson, David; Price, Liz; Fairclough, Ruth; Thelwall, Mike (Sage, 2004)
      Hyperlinks between academic web sites, like citations, can potentially be used to map disciplinary structures and identify evidence of connections between disciplines. In this paper we classified a sample of links originating in three different disciplines: maths, physics and sociology. Links within a discipline were found to be different in character to links between pages in different disciplines. There were also disciplinary differences in both types of link. As a consequence, we argue that interpretations of web science maps covering multiple disciplines will need to be sensitive to the contexts of the links mapped.
    • Identification of multiword expressions: A fresh look at modelling and evaluation

      Taslimipoor, Shiva; Rohanian, Omid; Mitkov, Ruslan; Fazly, Afsaneh; Markantonatou, Stella; Ramisch, Carlos; Savary, Agata; Vincze, Veronika (Language Science Press, 2018-10-25)
    • Identification of translationese: a machine learning approach

      Ilisei, Iustina; Inkpen, Diana; Corpas Pastor, Gloria; Mitkov, Ruslan; Gelbukh, A (Springer, 2010)
      This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.
    • Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

      Evans, Richard; Orasan, Constantin (Cambridge University Press, 2018-10-31)
      This article presents a new method to automatically simplify English sentences. The approach is designed to reduce the number of compound clauses and nominally bound relative clauses in input sentences. The article provides an overview of a corpus annotated with information about various explicit signs of syntactic complexity and describes the two major components of a sentence simplification method that works by exploiting information on the signs occurring in the sentences of a text. The first component is a sign tagger which automatically classifies signs in accordance with the annotation scheme used to annotate the corpus. The second component is an iterative rule-based sentence transformation tool. Exploiting the sign tagger in conjunction with other NLP components, the sentence transformation tool automatically rewrites long sentences containing compound clauses and nominally bound relative clauses as sequences of shorter single-clause sentences. Evaluation of the different components reveals acceptable performance in rewriting sentences containing compound clauses but less accuracy when rewriting sentences containing nominally bound relative clauses. A detailed error analysis revealed that the major sources of error include inaccurate sign tagging, the relatively limited coverage of the rules used to rewrite sentences, and an inability to discriminate between various subtypes of clause coordination. Despite this, the system performed well in comparison with two baselines. This finding was reinforced by automatic estimations of the readability of system output and by surveys of readers’ opinions about the accuracy, accessibility, and meaning of this output.
    • Improving translation memory matching and retrieval using paraphrases

      Gupta, Rohit; Orasan, Constantin; Zampieri, Marcos; Vela, Mihaela; van Genabith, Josef; Mitkov, Ruslan (Springer Nature, 2016-11-02)
      Most of the current Translation Memory (TM) systems work on string level (character or word level) and lack semantic knowledge while matching. They use simple edit-distance calculated on surface-form or some variation on it (stem, lemma), which does not take into consideration any semantic aspects in matching. This paper presents a novel and efficient approach to incorporating semantic information in the form of paraphrasing in the edit-distance metric. The approach computes edit-distance while efficiently considering paraphrases using dynamic programming and greedy approximation. In addition to using automatic evaluation metrics like BLEU and METEOR, we have carried out an extensive human evaluation in which we measured post-editing time, keystrokes, HTER, HMETEOR, and carried out three rounds of subjective evaluations. Our results show that paraphrasing substantially improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase-enhanced TMs.
    • Incorporating word embeddings in unsupervised morphological segmentation

      Üstün, Ahmet; Can, Burcu (Cambridge University Press (CUP), 2020-07-10)
      © The Author(s), 2020. Published by Cambridge University Press. We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.
    • Incremental adaptation using translation informations and post-editing analysis

      Blain, Frederic; Schwenk, Holger; Senellart, Jean (IWSLT, 2012-12-06)
      It is well known that statistical machine translation systems perform best when they are adapted to the task. In this paper we propose new methods to quickly perform incremental adaptation without the need to obtain word-by-word alignments from GIZA or similar tools. The main idea is to use an automatic translation as pivot to infer alignments between the source sentence and the reference translation, or user correction. We compared our approach to the standard method to perform incremental re-training. We achieve similar results in the BLEU score using less computational resources. Fast retraining is particularly interesting when we want to almost instantly integrate user feed-back, for instance in a post-editing context or machine translation assisted CAT tool. We also explore several methods to combine the translation models.
    • Inteliterm: in search of efficient terminology lookup tools for translators

      Corpas Pastor, G.; Durán-Muñoz, Isabel; Domínguez Vázquez, María José; Mirazo Balsa, Mónica; Valcárcel Riveiro, Carlos (De Gruyter, 2019-12-16)
    • Intelligent Natural Language Processing: Trends and Applications

      Orăsan, Constantin; Evans, Richard; Mitkov, Ruslan (Springer, 2017)
      Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.
    • Intelligent text processing to help readers with autism

      Orăsan, C; Evans, R; Mitkov, R (Springer International Publishing, 2017-11-18)
      © 2018, Springer International Publishing AG. Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.
    • Interpreting correlations between citation counts and other indicators

      Thelwall, Mike (Springer, 2016-05-09)
      Altmetrics or other indicators for the impact of academic outputs are often correlated with citation counts in order to help assess their value. Nevertheless, there are no guidelines about how to assess the strengths of the correlations found. This is a problem because this value affects the conclusions that should be drawn. In response, this article uses experimental simulations to assess the correlation strengths to be expected under various different conditions. The results show that the correlation strength reflects not only the underlying degree of association but also the average magnitude of the numbers involved. Overall, the results suggest that due to the number of assumptions that must be made in practice it will rarely be possible to make a realistic interpretation of the strength of a correlation coefficient.
    • Interpreting social science link analysis research: A theoretical framework

      Thelwall, Mike (Wiley, 2006)
      Link analysis in various forms is now an established technique in many different subjects, reflecting the perceived importance of links and of the Web. A critical but very difficult issue is how to interpret the results of social science link analyses. It is argued that the dynamic nature of the Web, its lack of quality control, and the online proliferation of copying and imitation mean that methodologies operating within a highly positivist, quantitative framework are ineffective. Conversely, the sheer variety of the Web makes application of qualitative methodologies and pure reason very problematic to large-scale studies. Methodology triangulation is consequently advocated, in combination with a warning that the Web is incapable of giving definitive answers to large-scale link analysis research questions concerning social factors underlying link creation. Finally, it is claimed that although theoretical frameworks are appropriate for guiding research, a Theory of Link Analysis is not possible.
    • Introduction

      Corpas Pastor, Gloria; Colson, Jean-Pierre (John Benjamins Publishing Company, 2020-05-08)
    • Is Medical Research Informing Professional Practice More Highly Cited? Evidence from AHFS DI Essentials in Drugs.com

      Thelwall, Mike; Kousha, Kayvan; Abdoli, Mahshid (Springer, 2017-02-21)
      Citation-based indicators are often used to help evaluate the impact of published medical studies, even though the research has the ultimate goal of improving human wellbeing. One direct way of influencing health outcomes is by guiding physicians and other medical professionals about which drugs to prescribe. A high profile source of this guidance is the AHFS DI Essentials product of the American Society of Health-System Pharmacists, which gives systematic information for drug prescribers. AHFS DI Essentials documents, which are also indexed by Drugs.com, include references to academic studies and the referenced work is therefore helping patients by guiding drug prescribing. This article extracts AHFS DI Essentials documents from Drugs.com and assesses whether articles referenced in these information sheets have their value recognised by higher Scopus citation counts. A comparison of mean log-transformed citation counts between articles that are and are not referenced in AHFS DI Essentials shows that AHFS DI Essentials references are more highly cited than average for the publishing journal. This suggests that medical research influencing drug prescribing is more cited than average.
    • Joint learning of morphology and syntax with cross-level contextual information flow

      Can Buglalilar, Burcu; Aleçakır, Hüseyin; Manandhar, Suresh; Bozşahin, Cem (Cambridge University Press, 2022-06-30)
      We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor (2018) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for future comparison.
    • “Keep it simple!”: an eye-tracking study for exploring complexity and distinguishability of web pages for people with autism

      Eraslan, Sukru; Yesilada, Yeliz; Yaneva, Victoria; Ha, Le An (Springer Science and Business Media LLC, 2020-02-03)
      A major limitation of the international well-known standard web accessibility guidelines for people with cognitive disabilities is that they have not been empirically evaluated by using relevant user groups. Instead, they aim to anticipate issues that may arise following the diagnostic criteria. In this paper, we address this problem by empirically evaluating two of the most popular guidelines related to the visual complexity of web pages and the distinguishability of web-page elements. We conducted a comparative eye-tracking study with 19 verbal and highly independent people with autism and 19 neurotypical people on eight web pages with varying levels of visual complexity and distinguishability, with synthesis and browsing tasks. Our results show that people with autism have a higher number of fixations and make more transitions with synthesis tasks. When we consider the number of elements which are not related to given tasks, our analysis shows that they look at more irrelevant elements while completing the synthesis task on visually complex pages or on pages whose elements are not easily distinguishable. To the best of our knowledge, this is the first empirical behavioural study which evaluates these guidelines by showing that the high visual complexity of pages or the low distinguishability of page elements causes non-equivalent experience for people with autism.