• Detecting semantic difference: a new model based on knowledge and collocational association

      Taslimipoor, Shiva; Corpas Pastor, Gloria; Rohanian, Omid; Corpas Pastor, Gloria; Colson, Jean-Pierre (John Benjamins Publishing Company, 2020-05-08)
      Semantic discrimination among concepts is a daily exercise for humans when using natural languages. For example, given the words, airplane and car, the word flying can easily be thought and used as an attribute to differentiate them. In this study, we propose a novel automatic approach to detect whether an attribute word represents the difference between two given words. We exploit a combination of knowledge-based and co-occurrence features (collocations) to capture the semantic difference between two words in relation to an attribute. The features are scores that are defined for each pair of words and an attribute, based on association measures, n-gram counts, word similarity, and Concept-Net relations. Based on these features we designed a system that run several experiments on a SemEval-2018 dataset. The experimental results indicate that the proposed model performs better, or at least comparable with, other systems evaluated on the same data for this task.
    • Domain adaptation of Thai word segmentation models using stacked ensemble

      Limkonchotiwat, Peerat; Phatthiyaphaibun, Wannaphong; Sarwar, Raheem; Chuangsuwanich, Ekapol; Nutanong, Sarana (Association for Computational Linguistics, 2020-11-12)
      Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as “black boxes”. We propose a filter-and-refine solution based on the stacked-ensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our method against state-of-the-art models and transfer learning. Experimental results show that our proposed solution is an effective domain adaptation method and has a similar performance as the transfer learning method.
    • Sarcasm target identification with LSTM networks

      Bölücü, Necva; Can, Burcu (IEEE, 2020-12-31)
      Geçmi¸s yıllarda, kinayeli metinler üzerine yapılan çalı¸smalarda temel hedef metinlerin kinaye içerip içermediginin ˘ tespit edilmesiydi. Sosyal medya kullanımı ile birlikte siber zorbalıgın yaygınla¸sması, metinlerin sadece kinaye içerip içer- ˘ mediginin tespit edilmesinin yanısıra kinayeli metindeki hedefin ˘ belirlenmesini de gerekli kılmaya ba¸slamı¸stır. Bu çalı¸smada, kinayeli metinlerde hedef tespiti için bir derin ögrenme modeli ˘ kullanılarak hedef tespiti yapılmı¸s ve elde edilen sonuçlar literatürdeki ˙Ingilizce üzerine olan benzer çalı¸smalarla kıyaslanmı¸stır. Sonuçlar, önerdigimiz modelin kinaye hedef tespitinde benzer ˘ çalı¸smalara göre daha iyi çalı¸stıgını göstermektedir. The earlier work on sarcastic texts mainly concentrated on detecting the sarcasm on a given text. With the spread of cyber-bullying with the use of social media, it becomes also essential to identify the target of the sarcasm besides detecting the sarcasm. In this study, we propose a deep learning model for target identification on sarcastic texts and compare it with other work on English. The results show that our model outperforms the related work on sarcasm target identification.
    • Turkish music generation using deep learning

      Aydıngün, Anıl; Bağdatlıoğlu, Denizcan; Canbaz, Burak; Kökbıyık, Abdullah; Yavuz, M Furkan; Bölücü, Necva; Can, Burcu (IEEE, 2020-12-31)
      Bu çalı¸smada derin ögrenme ile Türkçe ¸sarkı bes- ˘ teleme üzerine yeni bir model tanıtılmaktadır. ¸Sarkı sözlerinin Tekrarlı Sinir Agları kullanan bir dil modeliyle otomatik olarak ˘ olu¸sturuldugu, melodiyi meydana getiren notaların da benzer ˘ ¸sekilde nöral dil modeliyle olu¸sturuldugu ve sözler ile melodinin ˘ bütünle¸stirilerek ¸sarkı sentezlemenin gerçekle¸stirildigi bu çalı¸sma ˘ Türkçe ¸sarkı besteleme için yapılan ilk çalı¸smadır. In this work, a new model is introduced for Turkish song generation using deep learning. It will be the first work on Turkish song generation that makes use of Recurrent Neural Networks to generate the lyrics automatically along with a language model, where the melody is also generated by a neural language model analogously, and then the singing synthesis is performed by combining the lyrics with the melody. It will be the first work on Turkish song generation.
    • Clustering word roots syntactically

      Ozturk, Mustafa Burak; Can, Burcu (IEEE, 2016-06-23)
      Distributional representation of words is used for both syntactic and semantic tasks. In this paper two different methods are presented for clustering word roots. In the first method, the distributional model word2vec [1] is used for clustering word roots, whereas distributional approaches are generally used for words. For this purpose, the distributional similarities of roots are modeled and the roots are divided into syntactic categories (noun, verb etc.). In the other method, two different models are proposed: an information theoretical model and a probabilistic model. With a metric [8] based on mutual information and with another metric based on Jensen-Shannon divergence, similarities of word roots are calculated and clustering is performed using these metrics. Clustering word roots has a significant role in other natural language processing applications such as machine translation and question answering, and in other applications that include language generation. We obtained a purity of 0.92 from the obtained clusters.
    • Context based automatic spelling correction for Turkish

      Bolucu, Necva; Can, Burcu (IEEE, 2019-06-20)
      Spelling errors are one of the crucial problems to be addressed in Natural Language Processing tasks. In this study, a context-based automatic spell correction method for Turkish texts is presented. The method combines the Noisy Channel Model with Hidden Markov Models to correct a given word. This study deviates from the other studies by also considering the contextual information of the word within the sentence. The proposed method is aimed to be integrated to other word-based spelling correction models.
    • Modeling morpheme triplets with a three-level hierarchical Dirichlet process

      Kumyol, Serkan; Can, Burcu (IEEE, 2017-03-13)
      Morphemes are not independent units and attached to each other based on morphotactics. However, they are assumed to be independent from each other to cope with the complexity in most of the models in the literature. We introduce a language independent model for unsupervised morphological segmentation using hierarchical Dirichlet process (HDP). We model the morpheme dependencies in terms of morpheme trigrams in each word. Trigrams, bigrams and unigrams are modeled within a three-level HDP, where the trigram Dirichlet process (DP) uses the bigram DP and bigram DP uses unigram DP as the base distribution. The results show that modeling morpheme dependencies improve the F-measure noticeably in English, Turkish and Finnish.
    • Neural text normalization for Turkish social media

      Goker, Sinan; Can, Burcu (IEEE, 2018-12-10)
      Social media has become a rich data source for natural language processing tasks with its worldwide use; however, it is hard to process social media data due to its informal nature. Text normalization is the task of transforming the noisy text into its canonical form. It generally serves as a preprocessing task in other NLP tasks that are applied to noisy text. In this study, we apply two approaches for Turkish text normalization: Contextual Normalization approach using distributed representations of words and Sequence-to-Sequence Normalization approach using neural encoder-decoder models. As the approaches applied to Turkish and also other languages are mostly rule-based, additional rules are required to be added to the normalization model in order to detect new error patterns arising from the change of the language use in social media. In contrast to rule-based approaches, the proposed approaches provide the advantage of normalizing different error patterns that change over time by training with a new dataset and updating the normalization model. Therefore, the proposed methods provide a solution to language change dependency in social media by updating the normalization model without defining new rules.
    • Stem-based PoS tagging for agglutinative languages

      Bolucu, Necva; Can, Burcu (IEEE, 2017-06-29)
      Words are made up of morphemes being glued together in agglutinative languages. This makes it difficult to perform part-of-speech tagging for these languages due to sparsity. In this paper, we present two Hidden Markov Model based Bayesian PoS tagging models for agglutinative languages. Our first model is word-based and the second model is stem-based where the stems of the words are obtained from other two unsupervised stemmers: HPS stemmer and Morfessor FlatCat. The results show that stemming improves the accuracy in PoS tagging. We present the results for Turkish as an agglutinative language and English as a morphologically poor language.
    • Using morpheme-level attention mechanism for Turkish sequence labelling

      Esref, Yasin; Can, Burcu (IEEE, 2019-08-22)
      With deep learning being used in natural language processing problems, there have been serious improvements in the solution of many problems in this area. Sequence labeling is one of these problems. In this study, we examine the effects of character, morpheme, and word representations on sequence labelling problems by proposing a model for the Turkish language by using deep neural networks. Modeling the word as a whole in agglutinative languages such as Turkish causes sparsity problem. Therefore, rather than handling the word as a whole, expressing a word through its characters or considering the morpheme and morpheme label information gives more detailed information about the word and mitigates the sparsity problem. In this study, we applied the existing deep learning models using different word or sub-word representations for Named Entity Recognition (NER) and Part-of-Speech Tagging (POS Tagging) in Turkish. The results show that using morpheme information of words improves the Turkish sequence labelling.
    • A cascaded unsupervised model for PoS tagging

      Bölücü, Necva; Can, Burcu (ACM, 2021-12-31)
      Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing (NLP), that assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective etc). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g. dependency parsing) and thereby extract the meaning of the sentence (e.g. semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.
    • Neural sentiment analysis of user reviews to predict user ratings

      Gezici, Bahar; Bolucu, Necva; Tarhan, Ayca; Can, Burcu (IEEE, 2019-11-21)
      The significance of user satisfaction is increasing in the competitive open source software (OSS) market. Application stores let users send their feedbacks for applications, which are in the form of user reviews or ratings. Developers are informed about bugs or any additional requirements with the help of this feedback and use it to increase the quality of the software. Moreover, potential users rely on this information as a success indicator to decide downloading the applications. Since it is usually costly to read all the reviews and evaluate their content, the ratings are taken as the base for the assessment. This makes the consistency of the contents with the ratings of the reviews important for healthy evaluation of the applications. In this study, we use recurrent neural networks to analyze the reviews automatically, and thereby predict the user ratings based on the reviews. We apply transfer learning from a huge volume, gold dataset of Amazon Customer Reviews. We evaluate the performance of our model on three mobile OSS applications in the Google Play Store and compare the predicted ratings and the original ratings of the users. Eventually, the predicted ratings have an accuracy of 87.61% compared to the original ratings of the users, which seems promising to obtain the ratings from the reviews especially if the former is absent or its consistency with the reviews is weak.
    • Unsupervised morphological segmentation using neural word embeddings

      Can, Burcu; Üstün, Ahmet (Springer International Publishing, 2016-09-21)
      We present a fully unsupervised method for morphological segmentation. Unlike many morphological segmentation systems, our method is based on semantic features rather than orthographic features. In order to capture word meanings, word embeddings are obtained from a two-level neural network [11]. We compute the semantic similarity between words using the neural word embeddings, which forms our baseline segmentation model. We model morphotactics with a bigram language model based on maximum likelihood estimates by using the initial segmentations from the baseline. Results show that using semantic features helps to improve morphological segmentation especially in agglutinating languages like Turkish. Our method shows competitive performance compared to other unsupervised morphological segmentation systems.
    • Native language identification of fluent and advanced non-native writers

      Sarwar, Raheem; Rutherford, Attapol T; Hassan, Saeed-Ul; Rakthanmanon, Thanawin; Nutanong, Sarana (Association for Computing Machinery (ACM), 2020-04-30)
      Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.
    • StyloThai: A scalable framework for stylometric authorship identification of Thai documents

      Sarwar, R; Porthaveepong, T; Rutherford, A; Rakthanmanon, T; Nutanong, S (Association for Computing Machinery (ACM), 2020-01-30)
      © 2020 Association for Computing Machinery. All rights reserved. Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains, such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language, community, or ethnicity. However, most of the existing solutions are designed for English, and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset, (ii) scale when the size of the candidate authors set increases, and (iii) perform well when the number of writing samples for each candidate author is low.We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses the probabilistic k nearest neighbors classifier by transforming each document into a collection of point sets. Specifically, this document transformation allows us to (i) use set distance measures associated with an outlier handling mechanism, (ii) capture stylistic variations within a document, and (iii) produce multiple predictions for a query document. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we investigate the effectiveness of each stylometric features category with the help of an ablation study. We found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases, (ii) our method outperforms all the competitors, and (iii) our feature space provides better performance than the feature space used by the existing study.
    • Turkish lexicon expansion by using finite state automata

      Öztürk, Burak; Can, Burcu (Scientific and Technological Research Council of Turkey, 2019-03-22)
      Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.
    • Unsupervised learning of allomorphs in Turkish

      Can, Burcu (Scientific and Technological Research Council of Turkey, 2017-07-30)
      One morpheme may have several surface forms that correspond to allomorphs. In English, ed and d are surface forms of the past tense morpheme, and s, es, and ies are surface forms of the plural or present tense morpheme. Turkish has a large number of allomorphs due to its morphophonemic processes. One morpheme can have tens of different surface forms in Turkish. This leads to a sparsity problem in natural language processing tasks in Turkish. Detection of allomorphs has not been studied much because of its difficulty. For example, t¨u and di are Turkish allomorphs (i.e. past tense morpheme), but all of their letters are different. This paper presents an unsupervised model to extract the allomorphs in Turkish. We are able to obtain an F-measure of 73.71% in the detection of allomorphs, and our model outperforms previous unsupervised models on morpheme clustering.
    • An effective and scalable framework for authorship attribution query processing

      Sarwar, R; Yu, C; Tungare, N; Chitavisutthivong, K; Sriratanawilai, S; Xu, Y; Chow, D; Rakthanmanon, T; Nutanong, S (Institute of Electrical and Electronics Engineers (IEEE), 2018-09-10)
      Authorship attribution aims at identifying the original author of an anonymous text from a given set of candidate authors and has a wide range of applications. The main challenge in authorship attribution problem is that the real-world applications tend to have hundreds of authors, while each author may have a small number of text samples, e.g., 5-10 texts/author. As a result, building a predictive model that can accurately identify the author of an anonymous text is a challenging task. In fact, existing authorship attribution solutions based on long text focus on application scenarios, where the number of candidate authors is limited to 50. These solutions generally report a significant performance reduction as the number of authors increases. To overcome this challenge, we propose a novel data representation model that captures stylistic variations within each document, which transforms the problem of authorship attribution into a similarity search problem. Based on this data representation model, we also propose a similarity query processing technique that can effectively handle outliers. We assess the accuracy of our proposed method against the state-of-the-art authorship attribution methods using real-world data sets extracted from Project Gutenberg. Our data set contains 3000 novels from 500 authors. Experimental results from this paper show that our method significantly outperforms all competitors. Specifically, as for the closed-set and open-set authorship attribution problems, our method have achieved higher than 95% accuracy.
    • A scalable framework for stylometric analysis query processing

      Nutanong, Sarana; Yu, Chenyun; Sarwar, Raheem; Xu, Peter; Chow, Dickson (IEEE, 2017-02-02)
      Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.
    • A scalable framework for stylometric analysis of multi-author documents

      Sarwar, Raheem; Yu, Chenyun; Nutanong, Sarana; Urailertprasert, Norawit; Vannaboot, Nattapol; Rakthanmanon, Thanawin; Pei, Jian; Manolopoulos, Yannis; Sadiq, Shazia W; Li, Jianxin (Springer, 2018-05-13)
      Stylometry is a statistical technique used to analyze the variations in the author’s writing styles and is typically applied to authorship attribution problems. In this investigation, we apply stylometry to authorship identification of multi-author documents (AIMD) task. We propose an AIMD technique called Co-Authorship Graph (CAG) which can be used to collaboratively attribute different portions of documents to different authors belonging to the same community. Based on CAG, we propose a novel AIMD solution which (i) significantly outperforms the existing state-of-the-art solution; (ii) can effectively handle a larger number of co-authors; and (iii) is capable of handling the case when some of the listed co-authors have not contributed to the document as a writer. We conducted an extensive experimental study to compare the proposed solution and the best existing AIMD method using real and synthetic datasets. We show that the proposed solution significantly outperforms existing state-of-the-art method.