• Understanding the geographical development of social movements: a web-link analysis of Slow Food

      HENDRIKX, BAS; DORMANS, STEFAN; LAGENDIJK, ARNOUD; Thelwall, Mike; Geography, Planning and Environment; Radboud University Nijmegen; Netherlands; Institute for Management Research; Radboud University Nijmegen; Geography, Planning and Environment; Radboud University Nijmegen; Statistical Cybermetrics Research Group; University of Wolverhampton (Wiley-Blackwell, 2016-11-29)
      Slow Food (SF) is a global, grassroots movement aimed at enhancing and sustaining local food cultures and traditions worldwide. Since its establishment in the 1980s, Slow Food groups have emerged across the world and embedded in a wide range of different contexts. In this article, we explain how the movement, as a diverse whole, is being shaped by complex dynamics existing between grassroots flexibilities and emerging drives for movement coherence and harmonization. Unlike conventional studies on social movements, our approach helps one to understand transnational social movements as being simultaneously coherent and diverse bodies of collective action. Drawing on work in the fields of relational geography, assemblage theory and webometric research, we develop an analytical strategy that navigates and maps the entire Slow Food movement by exploring its ‘double articulation’ between the material-connective and ideational-expressive. Focusing on representations of this connectivity and articulation on the internet, we combine methodologies of computation research (webometrics) with more qualitative forms of (web) discourse analysis to achieve this. Our results point to the significance of particular networks and nodal points that support such double movements, each presenting core logistical channels of the movement's operations as well as points of relay of new ideas and practices. A network-based analysis of ‘double articulation’ thus shows how the co-evolution of ideas and material practices cascades into major trends without having to rely on a ‘grand', singular explanation of a movement's development.
    • Unsupervised joint PoS tagging and stemming for agglutinative languages

      Bolucu, Necva; Can, Burcu (Association for Computing Machinery (ACM), 2019-01-25)
      The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.
    • Unsupervised learning of allomorphs in Turkish

      Can, Burcu (Scientific and Technological Research Council of Turkey, 2017-07-30)
      One morpheme may have several surface forms that correspond to allomorphs. In English, ed and d are surface forms of the past tense morpheme, and s, es, and ies are surface forms of the plural or present tense morpheme. Turkish has a large number of allomorphs due to its morphophonemic processes. One morpheme can have tens of different surface forms in Turkish. This leads to a sparsity problem in natural language processing tasks in Turkish. Detection of allomorphs has not been studied much because of its difficulty. For example, t¨u and di are Turkish allomorphs (i.e. past tense morpheme), but all of their letters are different. This paper presents an unsupervised model to extract the allomorphs in Turkish. We are able to obtain an F-measure of 73.71% in the detection of allomorphs, and our model outperforms previous unsupervised models on morpheme clustering.
    • Unsupervised morphological segmentation using neural word embeddings

      Can, Burcu; Üstün, Ahmet (Springer International Publishing, 2016-09-21)
      We present a fully unsupervised method for morphological segmentation. Unlike many morphological segmentation systems, our method is based on semantic features rather than orthographic features. In order to capture word meanings, word embeddings are obtained from a two-level neural network [11]. We compute the semantic similarity between words using the neural word embeddings, which forms our baseline segmentation model. We model morphotactics with a bigram language model based on maximum likelihood estimates by using the initial segmentations from the baseline. Results show that using semantic features helps to improve morphological segmentation especially in agglutinating languages like Turkish. Our method shows competitive performance compared to other unsupervised morphological segmentation systems.
    • Unsupervised quality estimation for neural machine translation

      Fomicheva, Marina; Sun, Shuo; Yankovskaya, Lisa; Blain, Frédéric; Guzmán, Francisco; Fishel, Mark; Aletras, Nikolaos; Chaudhary, Vishrav; Specia, Lucia (Association for Computational Linguistics, 2020-09-01)
      Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By employing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.
    • Urdu AI: writeprints for Urdu authorship identification

      Sarwar, Raheem; Hassan, Saeed-Ul (Association for Computing Machinery, 2021-12-31)
      The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.
    • USFD at SemEval-2016 task 1: putting different state-of-the-arts into a box

      Aker, Ahmet; Blain, Frederic; Duque, Andres; Fomicheva, Marina; Seva, Jurica; Shah, Kashif; Beck, Daniel (Association for Computational Linguistics, 2016-06)
      Aker, A., Blain, F., Duque, A., Fomicheva, M. et al. (2016) USFD at SemEval-2016 task 1: putting different state-of-the-arts into a box. In, Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Bethard, S., Carpuat, M., Cer, D., Jurgens, D. et al. (eds.) Stroudsburg, PA: Association for Computational Linguistics, pp. 609-613.
    • USFD’s phrase-level quality estimation systems

      Logacheva, Varvara; Blain, Frédéric; Specia, Lucia (Association for Computational Linguistics, 2016-08)
      Logacheva, V., Blain, F. and Specia, L. (2016) USFD’s phrase-level quality estimation systems. In, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Bojar, O., Buck, C., Chatterjee, R., Federmann, C. et al. (eds.) Stroudsburg, PA: Association for Computational Linguistics, pp. 800-805.
    • Using gaze data to predict multiword expressions

      Rohanian, Omid; Taslimipoor, Shiva; Yaneva, Victoria; Ha, Le An (INCOMA Ltd, 2017-09-01)
      In recent years gaze data has been increasingly used to improve and evaluate NLP models due to the fact that it carries information about the cognitive processing of linguistic phenomena. In this paper we conduct a preliminary study towards the automatic identification of multiword expressions based on gaze features from native and non-native speakers of English. We report comparisons between a part-ofspeech (POS) and frequency baseline to: i) a prediction model based solely on gaze data and ii) a combined model of gaze data, POS and frequency. In spite of the challenging nature of the task, best performance was achieved by the latter. Furthermore, we explore how the type of gaze data (from native versus non-native speakers) affects the prediction, showing that data from the two groups is discriminative to an equal degree. Finally, we show that late processing measures are more predictive than early ones, which is in line with previous research on idioms and other formulaic structures.
    • Using linguistic features to predict the response process complexity associated with answering clinical MCQs

      Yaneva, Victoria; Jurich, Daniel; Ha, Le An; Baldwin, Peter (Association for Computational Linguistics, 2021-04-30)
      This study examines the relationship between the linguistic characteristics of a test item and the complexity of the response process required to answer it correctly. Using data from a large-scale medical licensing exam, clustering methods identified items that were similar with respect to their relative difficulty and relative response-time intensiveness to create low response process complexity and high response process complexity item classes. Interpretable models were used to investigate the linguistic features that best differentiated between these classes from a descriptive and predictive framework. Results suggest that nuanced features such as the number of ambiguous medical terms help explain response process complexity beyond superficial item characteristics such as word count. Yet, although linguistic features carry signal relevant to response process complexity, the classification of individual items remains challenging.
    • Using morpheme-level attention mechanism for Turkish sequence labelling

      Esref, Yasin; Can, Burcu (IEEE, 2019-08-22)
      With deep learning being used in natural language processing problems, there have been serious improvements in the solution of many problems in this area. Sequence labeling is one of these problems. In this study, we examine the effects of character, morpheme, and word representations on sequence labelling problems by proposing a model for the Turkish language by using deep neural networks. Modeling the word as a whole in agglutinative languages such as Turkish causes sparsity problem. Therefore, rather than handling the word as a whole, expressing a word through its characters or considering the morpheme and morpheme label information gives more detailed information about the word and mitigates the sparsity problem. In this study, we applied the existing deep learning models using different word or sub-word representations for Named Entity Recognition (NER) and Part-of-Speech Tagging (POS Tagging) in Turkish. The results show that using morpheme information of words improves the Turkish sequence labelling.
    • Using natural language processing to predict item response times and improve test construction

      Baldwin, Peter; Yaneva, Victoria; Mee, Janet; Clauser, Brian E; Ha, Le An (Wiley, 2020-02-24)
      In this article, it is shown how item text can be represented by (a) 113 features quantifying the text's linguistic characteristics, (b) 16 measures of the extent to which an information‐retrieval‐based automatic question‐answering system finds an item challenging, and (c) through dense word representations (word embeddings). Using a random forests algorithm, these data then are used to train a prediction model for item response times and predicted response times then are used to assemble test forms. Using empirical data from the United States Medical Licensing Examination, we show that timing demands are more consistent across these specially assembled forms than across forms comprising randomly‐selected items. Because an exam's timing conditions affect examinee performance, this result has implications for exam fairness whenever examinees are compared with each other or against a common standard.
    • Using semi-automatic compiled corpora for medical terminology and vocabulary building in the healthcare domain

      Gutiérrez Florido, Rut; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam (Université Paris 13, 2013-10-28)
      English, Spanish and German are amongst the most spoken languages in Europe. Thus it is likely that patients from one EU member state seeking medical treatment in another will speak or understand one of these. However, there is a lack of resources to teach efficient communication between patients and medics. To combat this, the TELL-ME project will provide a fully targeted package. This includes learning materials for Medical English, Spanish and German aimed at medical staff already in the other countries or undertaking cross-border mobility. The learning process will be supported by computer-aided tools based on corpora. For this reason, in this workshop we present the semi-automatic compilation of the TELL-ME corpus, whose function is to support the e-learning platform of the TELL-ME project, together with its self-assessment exercises emphasising the importance of specialised terminology in the acquisition of communicative and language skills.
    • Verbal multiword expressions for identification of metaphor

      Rohanian, Omid; Rei, Marek; Taslimipoor, Shiva; Ha, Le (ACL, 2020-07-06)
      Metaphor is a linguistic device in which a concept is expressed by mentioning another. Identifying metaphorical expressions, therefore, requires a non-compositional understanding of semantics. Multiword Expressions (MWEs), on the other hand, are linguistic phenomena with varying degrees of semantic opacity and their identification poses a challenge to computational models. This work is the first attempt at analysing the interplay of metaphor and MWEs processing through the design of a neural architecture whereby classification of metaphors is enhanced by informing the model of the presence of MWEs. To the best of our knowledge, this is the first “MWE-aware” metaphor identification system paving the way for further experiments on the complex interactions of these phenomena. The results and analyses show that this proposed architecture reach state-of-the-art on two different established metaphor datasets.
    • The way to analyse ‘way’: A case study in word-specific local grammar

      Hanks, Patrick; Može, Sara (Oxford Academic, 2019-02-11)
      Traditionally, dictionaries are meaning-driven—that is, they list different senses (or supposed senses) of each word, but do not say much about the phraseology that distinguishes one sense from another. Grammars, on the other hand, are structure-driven: they attempt to describe all possible structures of a language, but say little about meaning, phraseology, or collocation. In both disciplines during the 20th century, the practice of inventing evidence rather than discovering it led to intermittent and unpredictable distortions of fact. Since 1987, attempts have been made in both lexicography (Cobuild) and syntactic theory (pattern grammar, construction grammar) to integrate meaning and phraseology. Corpora now provide empirical evidence on a large scale for lexicosyntactic description, but there is still a long way to go. Many cherished beliefs must be abandoned before a synthesis between empirical lexical analysis and grammatical theory can be achieved. In this paper, by empirical analysis of just one word (the noun way), we show how corpus evidence can be used to tackle the complexities of lexical and constructional meaning, providing new insights into the lexis-grammar interface.
    • Web citations in patents: Evidence of technological impact?

      Enrique Orduna-Malea; Thelwall, Mike; Kousha, Kayvan; EC3 Research Group, Universitat Politècnica de València (UPV), 46022 Valencia, Spain (Wiley Blackwell, 2017-07-17)
      Patents sometimes cite web pages either as general background to the problem being addressed or to identify prior publications that will limit the scope of the patent granted. Counts of the number of patents citing an organisation’s website may therefore provide an indicator of its technological capacity or relevance. This article introduces methods to extract URL citations from patents and evaluates the usefulness of counts of patent web citations as a technology indicator. An analysis of patents citing 200 US universities or 177 UK universities found computer science and engineering departments to be frequently cited, as well as research-related web pages, such as Wikipedia, YouTube or Internet Archive. Overall, however, patent URL citations seem to be frequent enough to be useful for ranking major US and the top few UK universities if popular hosted subdomains are filtered out, but the hit count estimates on the first search engine results page should not be relied upon for accuracy.
    • Web impact factors and search engine coverage

      Thelwall, Mike (MCB UP Ltd, 2000)
      Search engines index only a proportion of the web and this proportion is not determined randomly but by following algorithms that take into account the properties that impact factors measure. A survey was conducted in order to test the coverage of search engines and to decide whether their partial coverage is indeed an obstacle to using them to calculate web impact factors. The results indicate that search engine coverage, even of large national domains is extremely uneven and would be likely to lead to misleading calculations.
    • Web issue analysis: an integrated water resource management case study

      Thelwall, Mike; Vann, Katie; Fairclough, Ruth (Wiley InterScience, 2006)
      In this article Web issue analysis is introduced as a new technique to investigate an issue as reflected on the Web. The issue chosen, integrated water resource management (IWRM), is a United Nations-initiated paradigm for managing water resources in an international context, particularly in developing nations. As with many international governmental initiatives, there is a considerable body of online information about it: 41,381 hypertext markup language (HTML) pages and 28,735 PDF documents mentioning the issue were downloaded. A page uniform resource locator (URL) and link analysis revealed the international and sectoral spread of IWRM. A noun and noun phrase occurrence analysis was used to identify the issues most commonly discussed, revealing some unexpected topics such as private sector and economic growth. Although the complexity of the methods required to produce meaningful statistics from the data is disadvantageous to easy interpretation, it was still possible to produce data that could be subject to a reasonably intuitive interpretation. Hence Web issue analysis is claimed to be a useful new technique for information science.
    • Web log file analysis: backlinks and queries

      Thelwall, Mike (MCB UP Ltd, 2001)
      As has been described else where, web log files are a useful source of information about visitor site use, navigation behaviour, and, to some extent, demographics. But log files can also reveal the existence of both web pages and search engine queries that are sources of new visitors.This study extracts such information from a single web log files and uses it to illustrate its value, not only to th site owner but also to those interested in investigating the online behaviour of web users.
    • Web users with autism: eye tracking evidence for differences

      Eraslan, Sukru; Yaneva, Victoria; Yesilada, Yeliz; Harper, Simon (Taylor and Francis, 2018-12-11)
      Anecdotal evidence suggests that people with autism may have different processing strategies when accessing the web. However, limited empirical evidence is available to support this. This paper presents an eye tracking study with 18 participants with high-functioning autism and 18 neurotypical participants to investigate the similarities and differences between these two groups in terms of how they search for information within web pages. According to our analysis, people with autism are likely to be less successful in completing their searching tasks. They also have a tendency to look at more elements on web pages and make more transitions between the elements in comparison to neurotypical people. In addition, they tend to make shorter but more frequent fixations on elements which are not directly related to a given search task. Therefore, this paper presents the first empirical study to investigate how people with autism differ from neurotypical people when they search for information within web pages based on an in-depth statistical analysis of their gaze patterns.