• Register-Specific Collocational Constructions in English and Spanish: A Usage-Based Approach

      Pastor, Gloria Corpas (Science Publications, 2015-03-01)
      Constructions are usage-based, conventionalised pairings of form and function within a cline of complexity and schematisation. Most research within Construction Grammar has focused on the monolingual description of schematic constructions: Mainly in English, but to a lesser extent in other languages as well. By contrast, very little constructional analyses have been carried out across languages. In this study we will focus on a type of partially substantive construction from the point of view of contrastive analysis and translation which, to the best of our knowledge, is one of the first studies of this kind. The first half of the article lays down the theoretical foundations of the study and introduces Construction Grammar as well as other formalisms used in literature in order to provide a construal account of collocations, a pervasive phenomenon in language. The experimental part describes the case study of V NP collocations with disease/enfermedad in comparable corpora in English and Spanish, both in the general domain and in the specialised medical domain. It is provided a comparative analysis of these constructions across domains and languages in terms of token-type ratio (constructional restriction-rate), lexical function, type of determiner, frequency ranking of the verbal collocate and domain specificity of collocates, among others. New measures to assess construal bondness will be put forward (lexical filledness rate and individual productivity rate) and special attention will be paid to register-dependent equivalent semantic-functional counterparts in English and Spanish and mismatches.
    • BDAFRICA: diseño e implementación de una base de datos de la literatura poscolonial africana publicada en España

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Universidad de Valladolid, 2016-01-10)
      Este trabajo demuestra que no existe un repositorio que incluya los autores poscoloniales africanos publicados hasta el momento en España y que permita, por ende, realizar estudios cuantitativos y cualitativos del impacto de esta literatura con la precisión deseable. Esto supone una carencia tanto para investigaciones académicas como para el sector editorial a la hora de analizar tendencias de selección y recepción en el mercado. Ante esta situación, el objetivo primordial de este trabajo es diseñar e implementar una base de datos, basada en MySQL y delimitada por unos parámetros muy concretos, que recoja todas las obras de autores africanos publicadas en castellano en España entre 1972 (año en que España se unió al sistema ISBN) y 2014. Tras determinar unos criterios de diseño y unos protocolos de compilación específcos, el desarrollo metodológico se ha dividido en cuatro fases: recopilación, almacenamiento, tratamiento y difusión de los datos. Así, la base de datos BDÁFRICA consigue un doble objetivo: por un lado, proporciona a los investigadores datos fables en los que basar sus estudios y, por otro, permitiría ofrecer por primera vez datos estadísticos de la evolución de la publicación de obras de autores africanos en España en los últimos 42 años.
    • Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?

      Costa, Hernani; Muñoz, Isabel Dúran; Pastor, Gloria Corpas; Mitkov, Ruslan (Universidade de Vigo & Universidade do Minho, 2016-07-22)
      Decisões tomadas anteriormente à compilação de um corpo comparável têm um grande impacto na forma em que este será posteriormente construído e analisado. Diversas variáveis e critérios externos são normalmente seguidos na construção de um corpo, mas pouco se tem investigado sobre a sua distribuição de similaridade textual interna ou nas suas vantagens qualitativas para a investigação. Numa tentativa de preencher esta lacuna, este artigo tem como objetivo apresentar uma metodologia simples, contudo eficiente, capaz de medir o grau de similaridade interno de um corpo. Para isso, a metodologia proposta usa diversas técnicas de processamento de linguagem natural e vários métodos estatísticos, numa tentativa bem sucedida de avaliar o grau de similaridade entre documentos. Os nossos resultados demonstram que a utilização de uma lista de entidades comuns e um conjunto de medidas de similaridade distribucional são suficientes, não só para descrever e avaliar o grau de similaridade entre os documentos num corpo comparável, mas também para os classificar de acordo com seu grau de semelhança e, consequentemente, melhorar a qualidade do corpos através da eliminação de documentos irrelevantes.
    • Translating English verbal collocations into Spanish: On distribution and other relevant differences related to diatopic variation

      Corpas Pastor, Gloria (John Benjamins Publishing Company, 2015-12-21)
      Language varieties should be taken into account in order to enhance fluency and naturalness of translated texts. In this paper we will examine the collocational verbal range for prima-facie translation equivalents of words like decision and dilemma, which in both languages denote the act or process of reaching a resolution after consideration, resolving a question or deciding something. We will be mainly concerned with diatopic variation in Spanish. To this end, we set out to develop a giga-token corpus-based protocol which includes a detailed and reproducible methodology sufficient to detect collocational peculiarities of transnational languages. To our knowledge, this is one of the first observational studies of this kind. The paper is organised as follows. Section 1 introduces some basic issues about the translation of collocations against the background of languages’ anisomorphism. Section 2 provides a feature characterisation of collocations. Section 3 deals with the choice of corpora, corpus tools, nodes and patterns. Section 4 covers the automatic retrieval of the selected verb + noun (object) collocations in general Spanish and the co-existing national varieties. Special attention is paid to comparative results in terms of similarities and mismatches. Section 5 presents conclusions and outlines avenues of further research.
    • Recepción en España de la literatura africana en lengua inglesa: generación de datos estadísticos con la base de datos bibliográfica especializada BDÁFRICA

      Fernández Ruiz, MR; Corpas Pastor, G; Seghiri, M (Fundacio per la Universitat Oberta de Catalunya, 2018-11-20)
      El presente artículo examina la recepción de la literatura africana en lengua inglesa en España basándonos en BDÁFRICA, una base de datos bibliográfica que recoge obras de autores nacidos en África y publicadas en español y en España entre 1972 y 2014. Se ofrece una reflexión crítica de las dificultades para definir la literatura africana como objeto de estudio, debido a su complejidad y heterogeneidad. Se propone, además, un conciso recorrido historiográfico por la conformación del canon de dicha literatura que se ha realizado desde Occidente. Asimismo, se demuestra la falta de estudios estadísticos sobre la recepción de literatura africana en lengua inglesa en España. Respondiendo a esta necesidad, el objetivo del artículo es detallar y analizar los datos estadísticos inéditos que proporciona la base de datos, adoptando una metodología descriptiva. Los resultados de este estudio, que aporta datos cuantitativos y cualitativos fiables y novedosos, son originales en tanto en cuanto reflejan y señalan los problemas de la traducción de la literatura africana en lengua inglesa en España. BDÁFRICA, que es gratuita y está disponible en red, pretende ser un recurso y una fuente que estimule el desarrollo de la investigación en literatura poscolonial en España. Sin duda, esta base de datos bibliográfica especializada es una herramienta muy valiosa, especialmente para investigadores, traductores y editoriales interesados en literatura africana.
    • El EEES y la competencia tecnológica: los nuevos grados en Traducción

      Corpas Pastor, Gloria; Muñoz, María (Universidad de Las Palmas de Gran Canaria, Servicio de Publicaciones y Difusión Científica, 2015-04-23)
      El presente trabajo toma como punto de partida la investigación que se describe en Muñoz Ramos (2012). En él haremos una breve síntesis del origen y evolución del EEES hasta llegar a nuestros días y su repercusión en los estudios de Traducción. Daremos cuenta de la imbricación existente entre los principios constitutivos del Proceso de Bolonia y las Tecnologías de la Información y Comunicación (TIC), que se posicionan como las compañeras idóneas para la consecución de los objetivos de la Declaración de Bolonia. Finalmente, podremos comprobar cómo estos dos puntos convergen en los nuevos grados en Traducción españoles, que se ajustan al EEES y encuentran en las materias de tecnologías de la traducción la piedra angular de su razón de ser.
    • Nine terminology extraction Tools: Are they useful for translators?

      Costa, Hernani; Zaretskaya, Anna; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam (MultiLingual, 2016-04-01)
      Terminology extraction tools (TETs) have become an indispensable resource in education, research and business. Today, users can find a great variety of terminology extraction tools of all kinds, and they all offer different features. Apart from many other areas, these tools are especially helpful in the professional translation setting. We do not know, however, if the existing tools have all the necessary features for this kind of work. In search for the answer, we looked at nine selected tools available on the market to find out if they provide the translators’ most favorite features.
    • Corpus, Tecnología y Traducción

      Corpas Pastor, Gloria; Casas, M; García Antuña, M (Servicio de Publicaciones de la Universidad de Cádiz, 2012-04-25)
      No es casualidad que la Lingüística de Corpus floreciese especialmente en el contexto europeo. Recordemos que la investigación en tecnologías lingüísticas (o " industrias de la len-gua ") ha sido el marchamo de las políticas científicas europeas. 1 Desde ahí se ha favorecido la investigación en tecnologías lingüísticas como forma de salvaguardar, por un lado, la diversidad cultural y el multilingüismo de Europa, y, al mismo tiempo, superar las barreras y dificultades que esto supone para poder alcanzar los objetivos comunes a todos los europeos. Multilingüismo, multiculturalidad, traducción, tecnologías son rasgos inherentes a la sociedad europea actual. Se podría decir, además, que estas características definitorias han contribuido decisivamente al desarrollo de aplicaciones y recursos lingüísticos encaminados dar soporte a las políticas sociales europeas, y sus estribaciones en materia de comercio, educación e investigación. Si bien las tecnologías lingüísticas y el corpus se han abierto camino desde época muy temprana en las vertientes teóricas y aplicadas de la Lingüística, han sido necesarias varias décadas para que traductores e intérpretes se hayan subido por fin a este carro, que ya iba repleto de investigadores de otras disciplinas afines. En este trabajo realizaremos un breve ex-curso por lo que ha supuesto la incorporación de tales recursos y herramientas para el ámbito de la traducción y la interpretación, con especial referencia a las tecnologías propias del sec-1 Para una visión de conjunto sobre las políticas científicas europeas en materia de tecnologías lingüísticas, véase Corpas Pastor (2008).
    • Corpora for Computational Linguistics

      Orăsan, Constantin; Ha, Le An; Evans, Richard; Hasler, Laura; Mitkov, Ruslan (Universidade Federal de Santa Catarina, 2007-01-01)
      Since the mid 90s corpora has become very important for computational linguistics. This paper offers a survey of how they are currently used in different fields of the discipline, with particular emphasis on anaphora and coreference resolution, automatic summarisation and term extraction. Their influence on other fields is also briefly discussed.
    • Computing Happiness from Textual Data

      Mohamed, Emad; Mostafa, Safa (MDPI, 2019-07-03)
      In this paper, we use a corpus of about 100,000 happy moments written by people of different genders, marital statuses, parenthood statuses, and ages to explore the following questions: Are there differences between men and women, married and unmarried individuals, parents and non-parents, and people of different age groups in terms of their causes of happiness and how they express happiness? Can gender, marital status, parenthood status and/or age be predicted from textual data expressing happiness? The first question is tackled in two steps: first, we transform the happy moments into a set of topics, lemmas, part of speech sequences, and dependency relations; then, we use each set as predictors in multi-variable binary and multinomial logistic regressions to rank these predictors in terms of their influence on each outcome variable (gender, marital status, parenthood status and age). For the prediction task, we use character, lexical, grammatical, semantic, and syntactic features in a machine learning document classification approach. The classification algorithms used include logistic regression, gradient boosting, and fastText. Our results show that textual data expressing moments of happiness can be quite beneficial in understanding the “causes of happiness” for different social groups, and that social characteristics like gender, marital status, parenthood status, and, to some extent age, can be successfully predicted form such textual data. This research aims to bring together elements from philosophy and psychology to be examined by computational corpus linguistics methods in a way that promotes the use of Natural Language Processing for the Humanities.
    • Exploiting Data-Driven Hybrid Approaches to Translation in the EXPERT Project

      Orăsan, Constantin; Escartín, Carla Parra; Torres, Lianet Sepúlveda; Barbu, Eduard; Ji, Meng; Oakes, Michael (Cambridge University Press, 2019-06-13)
      Technologies have transformed the way we work, and this is also applicable to the translation industry. In the past thirty to thirty-five years, professional translators have experienced an increased technification of their work. Barely thirty years ago, a professional translator would not have received a translation assignment attached to an e-mail or via an FTP and yet, for the younger generation of professional translators, receiving an assignment by electronic means is the only reality they know. In addition, as pointed out in several works such as Folaron (2010) and Kenny (2011), professional translators now have a myriad of tools available to use in the translation process.
    • GCN-Sem at SemEval-2019 Task 1: Semantic Parsing using Graph Convolutional and Recurrent Neural Networks

      Taslimipoor, Shiva; Rohanian, Omid; Može, Sara (Association for Computational Linguistics, 2019-06-06)
      This paper describes the system submitted to the SemEval 2019 shared task 1 ‘Cross-lingual Semantic Parsing with UCCA’. We rely on the semantic dependency parse trees provided in the shared task which are converted from the original UCCA files and model the task as tagging. The aim is to predict the graph structure of the output along with the types of relations among the nodes. Our proposed neural architecture is composed of Graph Convolution and BiLSTM components. The layers of the system share their weights while predicting dependency links and semantic labels. The system is applied to the CONLLU format of the input data and is best suited for semantic dependency parsing.
    • Adults with High-functioning Autism Process Web Pages With Similar Accuracy but Higher Cognitive Effort Compared to Controls

      Yaneva, Victoria; Ha, Le; Eraslan, Sukru; Yesilada, Yeliz (ACM, 2019-05-13)
      To accommodate the needs of web users with high-functioning autism, a designer's only option at present is to rely on guidelines that: i) have not been empirically evaluated and ii) do not account for the di erent levels of autism severity. Before designing effective interventions, we need to obtain an empirical understanding of the aspects that speci c user groups need support with. This has not yet been done for web users at the high ends of the autism spectrum, as often they appear to execute tasks effortlessly, without facing barriers related to their neurodiverse processing style. This paper investigates the accuracy and efficiency with which high-functioning web users with autism and a control group of neurotypical participants obtain information from web pages. Measures include answer correctness and a number of eye-tracking features. The results indicate similar levels of accuracy for the two groups at the expense of efficiency for the autism group, showing that the autism group invests more cognitive effort in order to achieve the same results as their neurotypical counterparts.
    • Evaluation of a cross-lingual Romanian-English multi-document summariser

      Orǎsan, C; Chiorean, OA (European Language Resources Association, 2008-01-01)
      The rapid growth of the Internet means that more information is available than ever before. Multilingual multi-document summarisation offers a way to access this information even when it is not in a language spoken by the reader by extracting the GIST from related documents and translating it automatically. This paper presents an experiment in which Maximal Marginal Relevance (MMR), a well known multi-document summarisation method, is used to produce summaries from Romanian news articles. A task-based evaluation performed on both the original summaries and on their automatically translated versions reveals that they still contain a significant portion of the important information from the original texts. However, direct evaluation of the automatically translated summaries shows that they are not very legible and this can put off some readers who want to find out more about a topic.
    • New directions in the study of family names

      Hanks, Patrick; Boullón Agrelo, Ana Isabel (Consello da Cultura Galega, 2018-12-28)
      This paper explores and explains recent radical developments in resources and methodology for studying the origins, cultural associations, and histories of family names (also called ‘surnames’). It summarizes the current state of the art and outlines new resources and procedures that are now becoming available. It shows how such innovations can enable the correction of errors in previous work and improve the accuracy of dictionaries of family names, with a focus on the English-speaking world. Developments such as the digitization of archives are having a profound effect, not only on the interpretation and understanding of traditional, ‘established’ family names and their histories, but also of names in other languages and other cultures. There are literally millions of different family names in the world today, many of which have never been studied at all. What are good criteria for selection of entries in a dictionary of family names, and what can be said about them? What is the nature of the evidence? How stable (or how variable) are family names over time? What are the effects of factors such as migration? What is the relationship between family names and geographical locations, given that people can and do move around? What is the relationship between traditional philological and historical approaches to the subject and statistical analysis of newly available digitized data? The paper aims to contribute to productive discussion of such questions.
    • Predicting reading difficulty for readers with autism spectrum disorder

      Evans, Richard; Yaneva, Victoria; Temnikova, Irina (European Language Resources Association, 2016-05-23)
      People with autism experience various reading comprehension difficulties, which is one explanation for the early school dropout, reduced academic achievement and lower levels of employment in this population. To overcome this issue, content developers who want to make their textbooks, websites or social media accessible to people with autism (and thus for every other user) but who are not necessarily experts in autism, can benefit from tools which are easy to use, which can assess the accessibility of their content, and which are sensitive to the difficulties that autistic people might have when processing texts/websites. In this paper we present a preliminary machine learning readability model for English developed specifically for the needs of adults with autism. We evaluate the model on the ASD corpus, which has been developed specifically for this task and is, so far, the only corpus for which readability for people with autism has been evaluated. The results show that out model outperforms the baseline, which is the widely-used Flesch-Kincaid Grade Level formula.
    • Hybrid Arabic–French machine translation using syntactic re-ordering and morphological pre-processing

      Mohamed, Emad; Sadat, Fatiha (Elsevier BV, 2014-11-08)
      Arabic is a highly inflected language and a morpho-syntactically complex language with many differences compared to several languages that are heavily studied. It may thus require good pre-processing as it presents significant challenges for Natural Language Processing (NLP), specifically for Machine Translation (MT). This paper aims to examine how Statistical Machine Translation (SMT) can be improved using rule-based pre-processing and language analysis. We describe a hybrid translation approach coupling an Arabic–French statistical machine translation system using the Moses decoder with additional morphological rules that reduce the morphology of the source language (Arabic) to a level that makes it closer to that of the target language (French). Moreover, we introduce additional swapping rules for a structural matching between the source language and the target language. Two structural changes involving the positions of the pronouns and verbs in both the source and target languages have been attempted. The results show an improvement in the quality of translation and a gain in terms of BLEU score after introducing a pre-processing scheme for Arabic and applying these rules based on morphological variations and verb re-ordering (VS into SV constructions) in the source language (Arabic) according to their positions in the target language (French). Furthermore, a learning curve shows the improvement in terms on BLEU score under scarce- and large-resources conditions. The proposed approach is completed without increasing the amount of training data or radically changing the algorithms that can affect the translation or training engines.
    • What jihad questions do Muslims ask?

      Emad Mohamed; Bakinaz Abdalla (Indiana University Press, 2017-05-01)
      Using digital humanities tools and methods, we extract, classify, and analyze 1,006 jihad fatwas from a corpus of 164,000 online fatwas. We use the questions and page hits to rank clusters of fatwas in order to discover what jihad questions Muslims ask, what jihad issues interest Muslims the most, and what the targets of jihad may be. We focus more on the questions than the answers, since it is the questions that give us a window into what may be called the “Muslim collective mind.” The results show that jihad questions are interwoven with several key topics, from performance of prayers to expiation for homosexuality. While the Prophet Muhammad's military expeditions were the most asked about and most viewed category, since they provide a model of what jihad is, the second most important category was concubinage. When there was a specific target, Jews were found in 73% of the questions.
    • The Role of Corpus Pattern Analysis in Machine Translation Evaluation

      Bechara, Hanna; Moze, Sara; El-Maarouf, Ismail; Orasan, Constantin; Hanks, Patrick; Mitkov, Ruslan (Tradulex, 2015)
      This paper takes a preliminary look at the relation between verb pattern matches in the Pattern Dictionary of English Verbs (PDEV) and translation quality through a qualitative analysis of human-ranked sentences from 5 different machine translation systems. The purpose of the analysis is not only to determine whether verbs in the automatic translations and their immediate contexts match any pre-existing semanto-syntactic pattern in PDEV, but also to establish links between hypothesis sentences and the verbs in the reference translation. It attempts to answer the question of whether or not the semantic and syntactic information captured by Corpus Pattern Analysis (CPA) can indicate whether a sentence is a “good” translation. Two human annotators manually identified the occurrence of patterns in 50 translations and indicated whether these patterns match any identified pattern in the corresponding reference translation. Results indicate that CPA can be used to distinguish between well and ill-formed sentences.
    • Linking Verb Pattern Dictionaries of English and Spanish

      Baisa, Vít; Moze, Sara; Renau, Irene (ELRA, 2016-05-24)
      The paper presents the first step in the creation of a new multilingual and corpus-driven lexical resource by means of linking existing monolingual pattern dictionaries of English and Spanish verbs. The two dictionaries were compiled through Corpus Pattern Analysis (CPA) – an empirical procedure in corpus linguistics that associates word meaning with word use by means of analysis of phraseological patterns and collocations found in corpus data. This paper provides a first look into a number of practical issues arising from the task of linking corresponding patterns across languages via both manual and automatic procedures. In order to facilitate manual pattern linking, we implemented a heuristic-based algorithm to generate automatic suggestions for candidate verb pattern pairs, which obtained 80% precision. Our goal is to kick-start the development of a new resource for verbs that can be used by language learners, translators, editors and the research community alike.