• A cascaded unsupervised model for PoS tagging

      Bölücü, Necva; Can, Burcu (ACM, 2021-12-31)
      Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing (NLP), that assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective etc). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g. dependency parsing) and thereby extract the meaning of the sentence (e.g. semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.
    • Constructional idioms of ‘insanity’ in English and Spanish: A corpus-based study

      Corpas Pastor, Gloria (Elsevier, 2021-02-10)
      This paper presents a corpus-based study of constructions in English and Spanish, with a special emphasis on equivalent semantic-functional counterparts, and potential mismatches. Although usage/corpus-based Construction Grammar (CxG) has attracted much attention in recent years, most studies have dealt exclusively with monolingual constructions. In this paper we will focus on two constructions that represent conventional ways to express ‘insanity’ in both languages. The analysis will cover grammatical, semantic and informative aspects in order to establish a multi-linguistic prototype of the constructions. To that end, data from several giga-token corpora of contemporary spoken English and Spanish (parallel and comparable) have been selected. This study advances the explanatory potential of constructional idioms for the study of idiomaticity, variability and cross-language analysis. In addition, relevant findings on the dialectal distribution of certain idiom features across both languages and their national varieties are also reported.
    • Attention: there is an inconsistency between android permissions and application metadata!

      Alecakir, Huseyin; Can, Burcu; Sen, Sevil (Springer Science and Business Media LLC, 2021-01-07)
      Since mobile applications make our lives easier, there is a large number of mobile applications customized for our needs in the application markets. While the application markets provide us a platform for downloading applications, it is also used by malware developers in order to distribute their malicious applications. In Android, permissions are used to prevent users from installing applications that might violate the users’ privacy by raising their awareness. From the privacy and security point of view, if the functionality of applications is given in sufficient detail in their descriptions, then the requirement of requested permissions could be well-understood. This is defined as description-to-permission fidelity in the literature. In this study, we propose two novel models that address the inconsistencies between the application descriptions and the requested permissions. The proposed models are based on the current state-of-art neural architectures called attention mechanisms. Here, we aim to find the permission statement words or sentences in app descriptions by using the attention mechanism along with recurrent neural networks. The lack of such permission statements in application descriptions creates a suspicion. Hence, the proposed approach could assist in static analysis techniques in order to find suspicious apps and to prioritize apps for more resource intensive analysis techniques. The experimental results show that the proposed approach achieves high accuracy.
    • Turkish music generation using deep learning

      Aydıngün, Anıl; Bağdatlıoğlu, Denizcan; Canbaz, Burak; Kökbıyık, Abdullah; Yavuz, M Furkan; Bölücü, Necva; Can, Burcu (IEEE, 2021-01-07)
      Bu çalı¸smada derin ögrenme ile Türkçe ¸sarkı bes- ˘ teleme üzerine yeni bir model tanıtılmaktadır. ¸Sarkı sözlerinin Tekrarlı Sinir Agları kullanan bir dil modeliyle otomatik olarak ˘ olu¸sturuldugu, melodiyi meydana getiren notaların da benzer ˘ ¸sekilde nöral dil modeliyle olu¸sturuldugu ve sözler ile melodinin ˘ bütünle¸stirilerek ¸sarkı sentezlemenin gerçekle¸stirildigi bu çalı¸sma ˘ Türkçe ¸sarkı besteleme için yapılan ilk çalı¸smadır. In this work, a new model is introduced for Turkish song generation using deep learning. It will be the first work on Turkish song generation that makes use of Recurrent Neural Networks to generate the lyrics automatically along with a language model, where the melody is also generated by a neural language model analogously, and then the singing synthesis is performed by combining the lyrics with the melody. It will be the first work on Turkish song generation.
    • Sarcasm target identification with LSTM networks

      Bölücü, Necva; Can, Burcu (IEEE, 2021-01-07)
      Geçmi¸s yıllarda, kinayeli metinler üzerine yapılan çalı¸smalarda temel hedef metinlerin kinaye içerip içermediginin ˘ tespit edilmesiydi. Sosyal medya kullanımı ile birlikte siber zorbalıgın yaygınla¸sması, metinlerin sadece kinaye içerip içer- ˘ mediginin tespit edilmesinin yanısıra kinayeli metindeki hedefin ˘ belirlenmesini de gerekli kılmaya ba¸slamı¸stır. Bu çalı¸smada, kinayeli metinlerde hedef tespiti için bir derin ögrenme modeli ˘ kullanılarak hedef tespiti yapılmı¸s ve elde edilen sonuçlar literatürdeki ˙Ingilizce üzerine olan benzer çalı¸smalarla kıyaslanmı¸stır. Sonuçlar, önerdigimiz modelin kinaye hedef tespitinde benzer ˘ çalı¸smalara göre daha iyi çalı¸stıgını göstermektedir. The earlier work on sarcastic texts mainly concentrated on detecting the sarcasm on a given text. With the spread of cyber-bullying with the use of social media, it becomes also essential to identify the target of the sarcasm besides detecting the sarcasm. In this study, we propose a deep learning model for target identification on sarcastic texts and compare it with other work on English. The results show that our model outperforms the related work on sarcasm target identification.
    • Webometrics: evolution of social media presence of universities

      Sarwar, Raheem; Zia, Afifa; Nawaz, Raheel; Fayoumi, Ayman; Aljohani, Naif Radi; Hassan, Saeed-Ul (Springer Science and Business Media LLC, 2021-01-03)
      This paper aims at an important task of computing the webometrics university ranking and investigating if there exists a correlation between webometrics university ranking and the rankings provided by the world prominent university rankers such as QS world university ranking, for the time period of 2005–2016. However, the webometrics portal provides the required data for the recent years only, starting from 2012, which is insufficient for such an investigation. The rest of the required data can be obtained from the internet archive. However, the existing data extraction tools are incapable of extracting the required data from internet archive, due to unusual link structure that consists of web archive link, year, date, and target links. We developed an internet archive scrapper and extract the required data, for the time period of 2012–2016. After extracting the data, the webometrics indicators were quantified, and the universities were ranked accordingly. We used correlation coefficient to identify the relationship between webometrics university ranking computed by us and the original webometrics university ranking, using the spearman and pearson correlation measures. Our findings indicate a strong correlation between ours and the webometrics university rankings, which proves that the applied methodology can be used to compute the webometrics university ranking of those years for which the ranking is not available, i.e., from 2005 to 2011. We compute the webometrics ranking of the top 30 universities of North America, Europe and Asia for the time period of 2005–2016. Our findings indicate a positive correlation for North American and European universities, but weak correlation for Asian universities. This can be explained by the fact that Asian universities did not pay much attention to their websites as compared to the North American and European universities. The overall results reveal the fact that North American and European universities are higher in rank as compared to Asian universities. To the best of our knowledge, such an investigation has been executed for the very first time by us and no recorded work resembling this has been done before.
    • An exploratory study on multilingual quality estimation

      Sun, Shuo; Fomicheva, Marina; Blain, Frederic; Chaudhary, Vishrav; El-Kishky, Ahmed; Renduchintala, Adithya; Guzman, Francisco; Specia, Lucia (Association for Computational Linguistics, 2020-12-31)
      Predicting the quality of machine translation has traditionally been addressed with language-specific models, under the assumption that the quality label distribution or linguistic features exhibit traits that are not shared across languages. An obvious disadvantage of this approach is the need for labelled data for each given language pair. We challenge this assumption by exploring different approaches to multilingual Quality Estimation (QE), including using scores from translation models. We show that these outperform singlelanguage models, particularly in less balanced quality label distributions and low-resource settings. In the extreme case of zero-shot QE, we show that it is possible to accurately predict quality for any given new language from models trained on other languages. Our findings indicate that state-of-the-art neural QE models based on powerful pre-trained representations generalise well across languages, making them more applicable in real-world settings.
    • La tecnología habla-texto como herramienta de documentación para intérpretes: Nuevo método para compilar un corpus ad hoc y extraer terminología a partir de discursos orales en vídeo

      Gaber, Mahmoud; Corpas Pastor, Gloria; Omer, Ahmed (Malaga University, 2020-12-22)
      Although interpreting has not yet benefited from technology as much as its sister field, translation, interest in developing tailor-made solutions for interpreters has risen sharply in recent years. In particular, Automatic Speech Recognition (ASR) is being used as a central component of Computer-Assisted Interpreting (CAI) tools, either bundled or standalone. This study pursues three main aims: (i) to establish the most suitable ASR application for building ad hoc corpora by comparing several ASR tools and assessing their performance; (ii) to use ASR in order to extract terminology from the transcriptions obtained from video-recorded speeches, in this case talks on climate change and adaptation; and (iii) to promote the adoption of ASR as a new documentation tool among interpreters. To the best of our knowledge, this is one of the first studies to explore the possibility of Speech-to-Text (S2T) technology for meeting the preparatory needs of interpreters as regards terminology and background/domain knowledge.
    • Findings of the WMT 2020 shared task on quality estimation

      Specia, Lucia; Blain, Frédéric; Fomicheva, Marina; Fonseca, Erick; Chaudhary, Vishrav; Guzmán, Francisco; Martins, André FT (Association for Computational Linguistics, 2020-11-30)
      We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels. This edition included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian-English, Estonian-English, Sinhala-English and Nepali-English data for the sentence-level subtasks, English-German and English-Chinese for the word-level subtask, and English-French data for the document-level subtask. In addition, we made neural machine translation models available to participants. 19 participating teams from 27 institutions submitted altogether 1374 systems to different task variants and language pairs.
    • BERGAMOT-LATTE submissions for the WMT20 quality estimation shared task

      Fomicheva, Marina; Sun, Shuo; Yankovskaya, Lisa; Blain, Frédéric; Chaudhary, Vishrav; Fishel, Mark; Guzmán, Francisco; Specia, Lucia (Association for Computational Linguistics, 2020-11-30)
      This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.
    • Bridging the “gApp”: improving neural machine translation systems for multiword expression detection

      Hidalgo-Ternero, Carlos Manuel; Pastor, Gloria Corpas (Walter de Gruyter GmbH, 2020-11-25)
      The present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.
    • Domain adaptation of Thai word segmentation models using stacked ensemble

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2020-11-12)
      Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as “black boxes”. We propose a filter-and-refine solution based on the stacked-ensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our method against state-of-the-art models and transfer learning. Experimental results show that our proposed solution is an effective domain adaptation method and has a similar performance as the transfer learning method.
    • Unsupervised quality estimation for neural machine translation

      Fomicheva, Marina; Sun, Shuo; Yankovskaya, Lisa; Blain, Frédéric; Guzmán, Francisco; Fishel, Mark; Aletras, Nikolaos; Chaudhary, Vishrav; Specia, Lucia (Association for Computational Linguistics, 2020-09-01)
      Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By employing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.
    • Incorporating word embeddings in unsupervised morphological segmentation

      Üstün, Ahmet; Can, Burcu (Cambridge University Press (CUP), 2020-07-10)
      © The Author(s), 2020. Published by Cambridge University Press. We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.
    • Verbal multiword expressions for identification of metaphor

      Rohanian, Omid; Rei, Marek; Taslimipoor, Shiva; Ha, Le (ACL, 2020-07-06)
      Metaphor is a linguistic device in which a concept is expressed by mentioning another. Identifying metaphorical expressions, therefore, requires a non-compositional understanding of semantics. Multiword Expressions (MWEs), on the other hand, are linguistic phenomena with varying degrees of semantic opacity and their identification poses a challenge to computational models. This work is the first attempt at analysing the interplay of metaphor and MWEs processing through the design of a neural architecture whereby classification of metaphors is enhanced by informing the model of the presence of MWEs. To the best of our knowledge, this is the first “MWE-aware” metaphor identification system paving the way for further experiments on the complex interactions of these phenomena. The results and analyses show that this proposed architecture reach state-of-the-art on two different established metaphor datasets.
    • Multimodal quality estimation for machine translation

      Okabe, Shu; Blain, Frédéric; Specia, Lucia (Association for Computational Linguistics, 2020-07)
      We propose approaches to Quality Estimation (QE) for Machine Translation that explore both text and visual modalities for Multimodal QE. We compare various multimodality integration and fusion strategies. For both sentence-level and document-level predictions, we show that state-of-the-art neural and feature-based QE frameworks obtain better results when using the additional modality.
    • Tweet coupling: a social media methodology for clustering scientific publications

      Hassan, SU; Aljohani, NR; Shabbir, M; Ali, U; Iqbal, S; Sarwar, R; Martínez-Cámara, E; Ventura, S; Herrera, F (Springer Science and Business Media LLC, 2020-05-18)
      © 2020, Akadémiai Kiadó, Budapest, Hungary. We argue that classic citation-based scientific document clustering approaches, like co-citation or Bibliographic Coupling, lack to leverage the social-usage of the scientific literature originate through online information dissemination platforms, such as Twitter. In this paper, we present the methodology Tweet Coupling, which measures the similarity between two or more scientific documents if one or more Twitter users mention them in the tweet(s). We evaluate our proposal on an altmetric dataset, which consists of 3081 scientific documents and 8299 unique Twitter users. By employing the clustering approaches of Bibliographic Coupling and Tweet Coupling, we find the relationship between the bibliographic and tweet coupled scientific documents. Further, using VOSviewer, we empirically show that Tweet Coupling appears to be a better clustering methodology to generate cohesive clusters since it groups similar documents from the subfields of the selected field, in contrast to the Bibliographic Coupling approach that groups cross-disciplinary documents in the same cluster.
    • A first dataset for film age appropriateness investigation

      Mohamed, Emad; Ha, Le An (LREC, 2020-05-13)
    • What matters more: the size of the corpora or their quality? The case of automatic translation of multiword expressions using comparable corpora.

      Mitkov, Ruslan; Taslimipoor, Shiva (John Benjamins, 2020-05-08)
      This study investigates (and compares) the impact of the size and the similarity/quality of comparable corpora on the specific task of extracting translation equivalents of verb-noun collocations from such corpora. The comprehensive evaluation of different configurations of English and Spanish corpora sheds some light on the more general and perennial question: what matters more – the quantity or quality of corpora?
    • Detecting semantic difference: a new model based on knowledge and collocational association

      Taslimipoor, Shiva; Corpas Pastor, Gloria; Rohanian, Omid; Corpas Pastor, Gloria; Colson, Jean-Pierre (John Benjamins Publishing Company, 2020-05-08)
      Semantic discrimination among concepts is a daily exercise for humans when using natural languages. For example, given the words, airplane and car, the word flying can easily be thought and used as an attribute to differentiate them. In this study, we propose a novel automatic approach to detect whether an attribute word represents the difference between two given words. We exploit a combination of knowledge-based and co-occurrence features (collocations) to capture the semantic difference between two words in relation to an attribute. The features are scores that are defined for each pair of words and an attribute, based on association measures, n-gram counts, word similarity, and Concept-Net relations. Based on these features we designed a system that run several experiments on a SemEval-2018 dataset. The experimental results indicate that the proposed model performs better, or at least comparable with, other systems evaluated on the same data for this task.