Now showing items 1-20 of 253

    • Unsupervised quality estimation for neural machine translation

      Fomicheva, Marina; Sun, Shuo; Yankovskaya, Lisa; Blain, Frédéric; Guzmán, Francisco; Fishel, Mark; Aletras, Nikolaos; Chaudhary, Vishrav; Specia, Lucia (Association for Computational Linguistics, 2020-09-01)
      Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By employing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.
    • Incorporating word embeddings in unsupervised morphological segmentation

      Üstün, Ahmet; Can, Burcu (Cambridge University Press (CUP), 2020-07-10)
      © The Author(s), 2020. Published by Cambridge University Press. We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.
    • Transfer learning for Turkish named entity recognition on noisy text

      Kagan Akkaya, E; Can, B (Cambridge University Press (CUP), 2020-01-28)
      © Cambridge University Press 2020. In this article, we investigate using deep neural networks with different word representation techniques for named entity recognition (NER) on Turkish noisy text. We argue that valuable latent features for NER can, in fact, be learned without using any hand-crafted features and/or domain-specific resources such as gazetteers and lexicons. In this regard, we utilize character-level, character n-gram-level, morpheme-level, and orthographic character-level word representations. Since noisy data with NER annotation are scarce for Turkish, we introduce a transfer learning model in order to learn infrequent entity types as an extension to the Bi-LSTM-CRF architecture by incorporating an additional conditional random field (CRF) layer that is trained on a larger (but formal) text and a noisy text simultaneously. This allows us to learn from both formal and informal/noisy text, thus improving the performance of our model further for rarely seen entity types. We experimented on Turkish as a morphologically rich language and English as a relatively morphologically poor language. We obtained an entity-level F1 score of 67.39% on Turkish noisy data and 45.30% on English noisy data, which outperforms the current state-of-art models on noisy text. The English scores are lower compared to Turkish scores because of the intense sparsity in the data introduced by the user writing styles. The results prove that using subword information significantly contributes to learning latent features for morphologically rich languages.
    • Methods and algorithms for unsupervised learning of morphology

      Can, Burcu; Manandhar, Suresh (Springer, 2014-12-31)
      This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.
    • Qualitative analysis of post-editing for high quality machine translation

      Blain, Frédéric; Senellart, Jean; Schwenk, Holger; Plitt, Mirko; Roturier, Johann; AAMT, Asia-Pacific Association for Machine Translation (Asia-Pacific Association for Machine Translation, 2011-09-30)
      In the context of massive adoption of Machine Translation (MT) by human localization services in Post-Editing (PE) workflows, we analyze the activity of post-editing high quality translations through a novel PE analysis methodology. We define and introduce a new unit for evaluating post-editing effort based on Post-Editing Action (PEA) - for which we provide human evaluation guidelines and propose a process to automatically evaluate these PEAs. We applied this methodology on data sets from two technologically different MT systems. In that context, we could show that more than 35% of the remaining effort can be saved by introducing of global PEA and edit propagation.
    • Collaborative machine translation service for scientific texts

      Lambert, Patrik; Senellart, Jean; Romary, Laurent; Schwenk, Holger; Zipser, Florian; Lopez, Patrice; Blain, Frederic (Association for Computational Linguistics, 2012-04-30)
      French researchers are required to frequently translate into French the description of their work published in English. At the same time, the need for French people to access articles in English, or to international researchers to access theses or papers in French, is incorrectly resolved via the use of generic translation tools. We propose the demonstration of an end-to-end tool integrated in the HAL open archive for enabling efficient translation for scientific texts. This tool can give translation suggestions adapted to the scientific domain, improving by more than 10 points the BLEU score of a generic system. It also provides a post-edition service which captures user post-editing data that can be used to incrementally improve the translations engines. Thus it is helpful for users which need to translate or to access scientific texts.
    • Unsupervised joint PoS tagging and stemming for agglutinative languages

      Bolucu, Necva; Can, Burcu (Association for Computing Machinery (ACM), 2019-01-25)
      The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.
    • Automatic translation of scientific documents in the HAL archive

      Lambert, Patrik; Schwenk, Holger; Blain, Frederic (European Language Resources Association (ELRA), 2012-05-31)
      This paper describes the development of a statistical machine translation system between French and English for scientific papers. This system will be closely integrated into the French HAL open archive, a collection of more than 100.000 scientific papers. We describe the creation of in-domain parallel and monolingual corpora, the development of a domain specific translation system with the created resources, and its adaptation using monolingual resources only. These techniques allowed us to improve a generic system by more than 10 BLEU points.
    • Incremental adaptation using translation informations and post-editing analysis

      Blain, Frederic; Schwenk, Holger; Senellart, Jean (IWSLT, 2012-12-06)
      It is well known that statistical machine translation systems perform best when they are adapted to the task. In this paper we propose new methods to quickly perform incremental adaptation without the need to obtain word-by-word alignments from GIZA or similar tools. The main idea is to use an automatic translation as pivot to infer alignments between the source sentence and the reference translation, or user correction. We compared our approach to the standard method to perform incremental re-training. We achieve similar results in the BLEU score using less computational resources. Fast retraining is particularly interesting when we want to almost instantly integrate user feed-back, for instance in a post-editing context or machine translation assisted CAT tool. We also explore several methods to combine the translation models.
    • The Matecat Tool

      Federico, Marcello; Bertoldi, Nicola; Cettolo, Mauro; Negri, Matteo; Turchi, Marco; Trombetti, Marco; Cattelan, Alessandro; Farina, Antonio; Lupinetti, Domenico; Marines, Andrea; et al. (Dublin City University and Association for Computational Linguistics, 2014-08-31)
      We present a new web-based CAT tool providing translators with a professional work environment, integrating translation memories, terminology bases, concordancers, and machine translation. The tool is completely developed as open source software and has been already successfully deployed for business, research and education. The MateCat Tool represents today probably the best available open source platform for investigating, integrating, and evaluating under realistic conditions the impact of new machine translation technology on human post-editing.
    • Project adaptation over several days

      Blain, Frederic; Hazem, Amir; Bougares, Fethi; Barrault, Loic; Schwenk, Holger (Johannes Gutenberg University of Mainz, 2015-01-30)
    • Continuous adaptation to user feedback for statistical machine translation

      Blain, Frédéric; Bougares, Fethi; Hazem, Amir; Barrault, Loïc; Schwenk, Holger (Association for Computational Linguistics, 2015-06-30)
      This paper gives a detailed experiment feedback of different approaches to adapt a statistical machine translation system towards a targeted translation project, using only small amounts of parallel in-domain data. The experiments were performed by professional translators under realistic conditions of work using a computer assisted translation tool. We analyze the influence of these adaptations on the translator productivity and on the overall post-editing effort. We show that significant improvements can be obtained by using the presented adaptation techniques.
    • Characters or morphemes: how to represent words?

      Üstün, Ahmet; Kurfalı, Murathan; Can, Burcu (Association for Computational Linguistics, 2018)
      In this paper, we investigate the effects of using subword information in representation learning. We argue that using syntactic subword units effects the quality of the word representations positively. We introduce a morpheme-based model and compare it against to word-based, character-based, and character n-gram level models. Our model takes a list of candidate segmentations of a word and learns the representation of the word based on different segmentations that are weighted by an attention mechanism. We performed experiments on Turkish as a morphologically rich language and English with a comparably poorer morphology. The results show that morpheme-based models are better at learning word representations of morphologically complex languages compared to character-based and character n-gram level models since the morphemes help to incorporate more syntactic knowledge in learning, that makes morpheme-based models better at syntactic tasks.
    • SHEF-NN: translation quality estimation with neural networks

      Shah, Kashif; Logacheva, Varvara; Paetzold, G; Blain, Frederic; Beck, Daniel; Bougares, Fethi; Specia, Lucia (Association for Computational Linguistics, 2015-09-30)
      We describe our systems for Tasks 1 and 2 of the WMT15 Shared Task on Quality Estimation. Our submissions use (i) a continuous space language model to extract additional features for Task 1 (SHEFGP, SHEF-SVM), (ii) a continuous bagof-words model to produce word embeddings as features for Task 2 (SHEF-W2V) and (iii) a combination of features produced by QuEst++ and a feature produced with word embedding models (SHEFQuEst++). Our systems outperform the baseline as well as many other submissions. The results are especially encouraging for Task 2, where our best performing system (SHEF-W2V) only uses features learned in an unsupervised fashion.
    • Phrase level segmentation and labelling of machine translation errors

      Blain, Frédéric; Logacheva, Varvara; Specia, Lucia; Chair, Nicoletta Calzolari Conference; Choukri, Khalid; Declerck, Thierry; Grobelnik, Marko; Maegaard, Bente; Mariani, Joseph; Moreno, Asuncion; et al. (European Language Resources Association (ELRA), 2016-05)
      This paper presents our work towards a novel approach for Quality Estimation (QE) of machine translation based on sequences of adjacent words, the so-called phrases. This new level of QE aims to provide a natural balance between QE at word and sentence-level, which are either too fine grained or too coarse levels for some applications. However, phrase-level QE implies an intrinsic challenge: how to segment a machine translation into sequence of words (contiguous or not) that represent an error. We discuss three possible segmentation strategies to automatically extract erroneous phrases. We evaluate these strategies against annotations at phrase-level produced by humans, using a new dataset collected for this purpose.
    • USFD at SemEval-2016 task 1: putting different state-of-the-arts into a box

      Aker, Ahmet; Blain, Frederic; Duque, Andres; Fomicheva, Marina; Seva, Jurica; Shah, Kashif; Beck, Daniel (Association for Computational Linguistics, 2016-06)
      Aker, A., Blain, F., Duque, A., Fomicheva, M. et al. (2016) USFD at SemEval-2016 task 1: putting different state-of-the-arts into a box. In, Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Bethard, S., Carpuat, M., Cer, D., Jurgens, D. et al. (eds.) Stroudsburg, PA: Association for Computational Linguistics, pp. 609-613.
    • The QT21/HimL combined machine translation system

      Peter, Jan-Thorsten; Alkhouli, Tamer; Ney, Hermann; Huck, Matthias; Braune, Fabienne; Fraser, Alexander; Tamchyna, Aleš; Bojar, OndŖej; Haddow, Barry; Sennrich, Rico; et al. (Association for Computational Linguistics, 2016-08)
      Peter, J.-T., Alkhouli, T., Ney, H., Huck, M. et al. (2016) The QT21/HimL combined machine translation system. In, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Bojar, O., Buck, C., Chatterjee, R., Federmann, C. et al. Stroudsburg, PA: Association for Computational Linguistics, pp. 344-355.
    • USFD’s phrase-level quality estimation systems

      Logacheva, Varvara; Blain, Frédéric; Specia, Lucia (Association for Computational Linguistics, 2016-08)
      Logacheva, V., Blain, F. and Specia, L. (2016) USFD’s phrase-level quality estimation systems. In, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Bojar, O., Buck, C., Chatterjee, R., Federmann, C. et al. (eds.) Stroudsburg, PA: Association for Computational Linguistics, pp. 800-805.
    • Guiding neural machine translation decoding with external knowledge

      Chatterjee, Rajen; Negri, Matteo; Turchi, Marco; Federico, Marcello; Specia, Lucia; Blain, Frédéric (Association for Computational Linguistics, 2017-09)
      Chatterjee, R., Negri, M., Turchi, M., Federico, M. et al. (2017) Guiding neural machine translation decoding with external knowledge. In, Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Bojar, O., Buck, C., Chatterjee, R., Federmann, C. et al. (eds.) Stroudsburg, PA: Association for Computational Linguistics, pp. 157-168.
    • Sheffield systems for the English-Romanian translation task

      Blain, Frédéric; Song, Xingyi; Specia, Lucia (Association for Computational Linguistics, 2016-08)