• $1 Internet - exploiting limited bandwidth in developing countries

      Heinz, Ignatz; Dennett, Christopher Paul (2008)
      INFONET-BioVision.org is an freely available internet based knowledge-management system, funded by the Liechtenstein Development Service (LDS) and the BioVision Foundation for Environment and Development, that offers Kenyan farmers information on affordable, effective and ecologically sound technologies in crop and livestock production as well as environmental and human health. One of the challenges faced by the project is the secure provision of information to the rural areas that would most benefit from advice on crop pests and productivity [Avallain, 2008] . Bandwidth is sometimes available in these areas, but is limited, unmanaged and relatively expensive. This paper discusses current work in the development of a novel system that brings together hardware and software to make better use of available bandwidth, whilst offering a financially viable and sustainable method of extending internet provision to these hard-to-reach areas, providing rural farmers with access to the INFONET-BioVision platform and other internet based sources of information. The system currently in development is premised on the fact that some internet based applications require more bandwidth than others. Moreover, their real-time requirements differ greatly. Although it is conceivable that a number of users can share low-bandwidth connections, the multiple bandwidth requests created can easily overwhelm the connection due to the way in which these are managed by protocols developed for bandwidth-rich countries. This results in virtually no bandwidth availability for the user applications themselves. It is clear, therefore, that to maximise the number of users on one low-bandwidth connection, allocation should take place before applications actually make the bandwidth requests. Indeed, similar bandwidth management exists on a larger scale, with domestic broadband providers controlling the amount of data provided through existing channels to a home user at the exchange, based on a tariff system. The system effectively applies a scaled down version of this scenario to available connectivity, be that GPRS, satellite or wired. An inexpensive single-board computer acts as a hub between users and internet, allowing software management of bandwidth and connectivity to users mobile devices through Wi-Fi, Bluetooth and wired LAN. The allocation of bandwidth to each user is based on a voucher system that effectively splits the cost of the connection. Users purchase these vouchers, which are priced according to usage, ranging from very low, none real-time e-mail access to more expensive web-browsing, prior to accessing the system. The proposed system is intended to provide communities with inexpensive connectivity through shared costs which is scaleable, such that, should there be a requirement for extra provision of bandwidth or number of users, subsequent devices can be added or moved simply, easily and at low cost. The system is scheduled for testing later in 2008, at which point a full evaluation will be undertaken.
    • A Single Chip System for Sensor Data Fusion Based on a Drift-diffusion Model

      Yang, Shufan; Wong-Lin, Kongfatt; Rano, Inaki; Lindsay, Anthony (IEEE, 2017-09-07)
      Current multisensory system face data communication overhead in integrating disparate sensor data to build a coherent and accurate global phenomenon. We present here a novel hardware and software co-design platform for a heterogeneous data fusion solution based on a perceptual decision making approach (the drift-diffusion model). It provides a convenient infrastructure for sensor data acquisition and data integration and only uses a single chip Xilinx ZYNQ-7000 XC7Z020 AP SOC. A case study of controlling the moving speed of a single ground-based robot, according to physiological states of the operator based on heart rates, is conducted and demonstrates the possibility of integrated sensor data fusion architecture. The results of our DDM-based data integration shows a better correlation coefficient with the raw ECG signal compare with a simply piecewise approach.
    • Adults with High-functioning Autism Process Web Pages With Similar Accuracy but Higher Cognitive Effort Compared to Controls

      Yaneva, Victoria; Ha, Le; Eraslan, Sukru; Yesilada, Yeliz (ACM, 2019-05-31)
      To accommodate the needs of web users with high-functioning autism, a designer's only option at present is to rely on guidelines that: i) have not been empirically evaluated and ii) do not account for the di erent levels of autism severity. Before designing effective interventions, we need to obtain an empirical understanding of the aspects that speci c user groups need support with. This has not yet been done for web users at the high ends of the autism spectrum, as often they appear to execute tasks effortlessly, without facing barriers related to their neurodiverse processing style. This paper investigates the accuracy and efficiency with which high-functioning web users with autism and a control group of neurotypical participants obtain information from web pages. Measures include answer correctness and a number of eye-tracking features. The results indicate similar levels of accuracy for the two groups at the expense of efficiency for the autism group, showing that the autism group invests more cognitive effort in order to achieve the same results as their neurotypical counterparts.
    • Aggressive language identification using word embeddings and sentiment features

      Orasan, Constantin (Association for Computational Linguistics, 2018-06-25)
      This paper describes our participation in the First Shared Task on Aggression Identification. The method proposed relies on machine learning to identify social media texts which contain aggression. The main features employed by our method are information extracted from word embeddings and the output of a sentiment analyser. Several machine learning methods and different combinations of features were tried. The official submissions used Support Vector Machines and Random Forests. The official evaluation showed that for texts similar to the ones in the training dataset Random Forests work best, whilst for texts which are different SVMs are a better choice. The evaluation also showed that despite its simplicity the method performs well when compared with more elaborated methods.
    • All that Glitters is not Gold when Translating Phraseological Units

      Corpas Pastor, Gloria; Monti, Johanna; Mitkov, Ruslan; Corpas Pastor, Gloria; Seretan, Violeta (European Association for Machine Translation (EAMT), 2013-09-02)
      Phraseological unit is an umbrella term which covers a wide range of multi-word units (collocations, idioms, proverbs, routine formulae, etc.). Phraseological units (PUs) are pervasive in all languages and exhibit a peculiar combinatorial nature. PUs are usually frequent, cognitively salient, syntactically frozen and/or semantically opaque. Besides, their creative manipulations in discourse can be anything but predictable, straightforward or easy to process. And when it comes to translating, problems multiply exponentially. It goes without saying that cultural differences and linguistic anisomorphisms go hand in hand with issues arising from varying degrees of equivalence at the levels of system and text. No wonder PUs have been considered a pain in the neck within the NLP community. This presentation will focus on contrastive and translational features of phraseological units. It will consist of three parts. As a convenient background, the first part will contrast two similar concepts: multi-word unit (the preferred term within the NLP community) versus phraseological unit (the preferred term in phraseology). The second part will deal with phraseological systems in general, their structure and functioning. Finally, the third part will adopt a contrastive approach, with especial reference to translators’ strategies, procedures and choices. For good or for bad, when it comes to rendering phraseological units, human translation and computer-assisted translation appear to share the same garden path.
    • An evaluation of syntactic simplification rules for people with autism

      Evans, Richard; Orasan, Constantin; Dornescu, Iustin (Association for Computational Linguistics, 2014)
      Syntactically complex sentences constitute an obstacle for some people with Autistic Spectrum Disorders. This paper evaluates a set of simplification rules specifically designed for tackling complex and compound sentences. In total, 127 different rules were developed for the rewriting of complex sentences and 56 for the rewriting of compound sentences. The evaluation assessed the accuracy of these rules individually and revealed that fully automatic conversion of these sentences into a more accessible form is not very reliable.
    • Arabic-SOS: Segmentation, stemming, and orthography standardization for classical and pre-modern standard Arabic

      Mohamed, Emad; Sayed, Zeeshan (ACM, 2019-05-31)
      While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.
    • Autism detection based on eye movement sequences on the web: a scanpath trend analysis approach

      Eraslan, Sukru; Yesilada, Yeliz; Yaneva, Victoria; Harper, Simon; Duarte, Carlos; Drake, Ted; Hwang, Faustina; Lewis, Clayton (ACM, 2020-04-20)
      Autism diagnostic procedure is a subjective, challenging and expensive procedure and relies on behavioral, historical and parental report information. In our previous, we proposed a machine learning classifier to be used as a potential screening tool or used in conjunction with other diagnostic methods, thus aiding established diagnostic methods. The classifier uses eye movements of people on web pages but it only considers non-sequential data. It achieves the best accuracy by combining data from several web pages and it has varying levels of accuracy on different web pages. In this present paper, we investigate whether it is possible to detect autism based on eye-movement sequences and achieve stable accuracy across different web pages to be not dependent on specific web pages. We used Scanpath Trend Analysis (STA) which is designed for identifying a trending path of a group of users on a web page based on their eye movements. We first identify trending paths of people with autism and neurotypical people. To detect whether or not a person has autism, we calculate the similarity of his/her path to the trending paths of people with autism and neurotypical people. If the path is more similar to the trending path of neurotypical people, we classify the person as a neurotypical person. Otherwise, we classify her/him as a person with autism. We systematically evaluate our approach with an eye-tracking dataset of 15 verbal and highly-independent people with autism and 15 neurotypical people on six web pages. Our evaluation shows that the STA approach performs better on individual web pages and provides more stable accuracy across different pages.
    • Automatic question answering for medical MCQs: Can it go further than information retrieval?

      Ha, Le An; Yaneva, Viktoriya (RANLP, 2019-09-04)
      We present a novel approach to automatic question answering that does not depend on the performance of an information retrieval (IR) system and does not require training data. We evaluate the system performance on a challenging set of university-level medical science multiple-choice questions. Best performance is achieved when combining a neural approach with an IR approach, both of which work independently. Unlike previous approaches, the system achieves statistically significant improvement over the random guess baseline even for questions that are labeled as challenging based on the performance of baseline solvers.
    • Automatic translation of scientific documents in the HAL archive

      Lambert, Patrik; Schwenk, Holger; Blain, Frederic (European Language Resources Association (ELRA), 2012-05-31)
      This paper describes the development of a statistical machine translation system between French and English for scientific papers. This system will be closely integrated into the French HAL open archive, a collection of more than 100.000 scientific papers. We describe the creation of in-domain parallel and monolingual corpora, the development of a domain specific translation system with the created resources, and its adaptation using monolingual resources only. These techniques allowed us to improve a generic system by more than 10 BLEU points.
    • Backtranslation feedback improves user confidence in MT, not quality

      Zouhar, Vilém; Novák, Michal; Žilinec, Matúš; Bojar, Ondřej; Obregón, Mateo; Hill, Robin L; Blain, Frédéric; Fomicheva, Marina; Specia, Lucia; Yankovskaya, Lisa; et al. (Association for Computational Linguistics, 2021-06-01)
      Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.
    • BERGAMOT-LATTE submissions for the WMT20 quality estimation shared task

      Fomicheva, Marina; Sun, Shuo; Yankovskaya, Lisa; Blain, Frédéric; Chaudhary, Vishrav; Fishel, Mark; Guzmán, Francisco; Specia, Lucia (Association for Computational Linguistics, 2020-11-30)
      This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.
    • Bilexical embeddings for quality estimation

      Blain, Frédéric; Scarton, Carolina; Specia, Lucia (Association for Computational Linguistics, 2017-09)
    • Bilingual contexts from comparable corpora to mine for translations of collocations

      Taslimipoor, Shiva; Mitkov, Ruslan; Corpas Pastor, Gloria; Fazly, Afsaneh (Springer, 2018-03-21)
      Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.
    • Bridging the gap: attending to discontinuity in identification of multiword expressions

      Rohanian, Omid; Taslimipoor, Shiva; Kouchaki, Samaneh; Ha, Le An; Mitkov, Ruslan (Association for Computational Linguistics, 2019-06-05)
      We introduce a new method to tag Multiword Expressions (MWEs) using a linguistically interpretable language-independent deep learning architecture. We specifically target discontinuity, an under-explored aspect that poses a significant challenge to computational treatment of MWEs. Two neural architectures are explored: Graph Convolutional Network (GCN) and multi-head self-attention. GCN leverages dependency parse information, and self-attention attends to long-range relations. We finally propose a combined model that integrates complementary information from both, through a gating mechanism. The experiments on a standard multilingual dataset for verbal MWEs show that our model outperforms the baselines not only in the case of discontinuous MWEs but also in overall F-score.
    • Characters or morphemes: how to represent words?

      Üstün, Ahmet; Kurfalı, Murathan; Can, Burcu (Association for Computational Linguistics, 2018)
      In this paper, we investigate the effects of using subword information in representation learning. We argue that using syntactic subword units effects the quality of the word representations positively. We introduce a morpheme-based model and compare it against to word-based, character-based, and character n-gram level models. Our model takes a list of candidate segmentations of a word and learns the representation of the word based on different segmentations that are weighted by an attention mechanism. We performed experiments on Turkish as a morphologically rich language and English with a comparably poorer morphology. The results show that morpheme-based models are better at learning word representations of morphologically complex languages compared to character-based and character n-gram level models since the morphemes help to incorporate more syntactic knowledge in learning, that makes morpheme-based models better at syntactic tasks.
    • Classifying referential and non-referential it using gaze

      Yaneva, Victoria; Ha, Le An; Evans, Richard; Mitkov, Ruslan (Association for Computational Linguistics (ACL), 2018-10-31)
      When processing a text, humans and machines must disambiguate between different uses of the pronoun it, including non-referential, nominal anaphoric or clause anaphoric ones. In this paper, we use eye-tracking data to learn how humans perform this disambiguation. We use this knowledge to improve the automatic classification of it. We show that by using gaze data and a POS-tagger we are able to significantly outperform a common baseline and classify between three categories of it with an accuracy comparable to that of linguisticbased approaches. In addition, the discriminatory power of specific gaze features informs the way humans process the pronoun, which, to the best of our knowledge, has not been explored using data from a natural reading task.
    • Clustering word roots syntactically

      Ozturk, Mustafa Burak; Can, Burcu (IEEE, 2016-06-23)
      Distributional representation of words is used for both syntactic and semantic tasks. In this paper two different methods are presented for clustering word roots. In the first method, the distributional model word2vec [1] is used for clustering word roots, whereas distributional approaches are generally used for words. For this purpose, the distributional similarities of roots are modeled and the roots are divided into syntactic categories (noun, verb etc.). In the other method, two different models are proposed: an information theoretical model and a probabilistic model. With a metric [8] based on mutual information and with another metric based on Jensen-Shannon divergence, similarities of word roots are calculated and clustering is performed using these metrics. Clustering word roots has a significant role in other natural language processing applications such as machine translation and question answering, and in other applications that include language generation. We obtained a purity of 0.92 from the obtained clusters.
    • Collaborative machine translation service for scientific texts

      Lambert, Patrik; Senellart, Jean; Romary, Laurent; Schwenk, Holger; Zipser, Florian; Lopez, Patrice; Blain, Frederic (Association for Computational Linguistics, 2012-04-30)
      French researchers are required to frequently translate into French the description of their work published in English. At the same time, the need for French people to access articles in English, or to international researchers to access theses or papers in French, is incorrectly resolved via the use of generic translation tools. We propose the demonstration of an end-to-end tool integrated in the HAL open archive for enabling efficient translation for scientific texts. This tool can give translation suggestions adapted to the scientific domain, improving by more than 10 points the BLEU score of a generic system. It also provides a post-edition service which captures user post-editing data that can be used to incrementally improve the translations engines. Thus it is helpful for users which need to translate or to access scientific texts.
    • Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities

      Yaneva, Victoria; Orăsan, Constantin; Evans, Richard; Rohanian, Omid (Association for Computational Linguistics, 2017-09-08)
      Given the lack of large user-evaluated corpora in disability-related NLP research (e.g. text simplification or readability assessment for people with cognitive disabilities), the question of choosing suitable training data for NLP models is not straightforward. The use of large generic corpora may be problematic because such data may not reflect the needs of the target population. At the same time, the available user-evaluated corpora are not large enough to be used as training data. In this paper we explore a third approach, in which a large generic corpus is combined with a smaller population-specific corpus to train a classifier which is evaluated using two sets of unseen user-evaluated data. One of these sets, the ASD Comprehension corpus, is developed for the purposes of this study and made freely available. We explore the effects of the size and type of the training data used on the performance of the classifiers, and the effects of the type of the unseen test datasets on the classification performance.