• National Scientific Performance Evolution Patterns: Retrenchment, Successful Expansion, or Overextension

      Thelwall, Mike; Levitt, Jonathan M. (Wiley-Blackwell, 2017-11-17)
      National governments would like to preside over an expanding and increasingly high impact science system but are these two goals largely independent or closely linked? This article investigates the relationship between changes in the share of the world’s scientific output and changes in relative citation impact for 2.6 million articles from 26 fields in the 25 countries with the most Scopus-indexed journal articles from 1996 to 2015. There is a negative correlation between expansion and relative citation impact but their relationship varies. China, Spain, Australia, and Poland were successful overall across the 26 fields, expanding both their share of the world’s output and its relative citation impact, whereas Japan, France, Sweden and Israel had decreased shares and relative citation impact. In contrast, the USA, UK, Germany, Italy, Russia, Netherlands, Switzerland, Finland, and Denmark all enjoyed increased relative citation impact despite a declining share of publications. Finally, India, South Korea, Brazil, Taiwan, and Turkey all experienced sustained expansion but a recent fall in relative citation impact. These results may partly reflect changes in the coverage of Scopus and the selection of fields.
    • Native language identification of fluent and advanced non-native writers

      Sarwar, Raheem; Rutherford, Attapol T; Hassan, Saeed-Ul; Rakthanmanon, Thanawin; Nutanong, Sarana (Association for Computing Machinery (ACM), 2020-04-30)
      Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.
    • Natural language processing for mental disorders: an overview

      Calixto, Iacer; Yaneva, Viktoriya; Cardoso, Raphael; Bojar, Ondrej; Dash, Satya Ranjan; Parida, Shantipriya; Tello, Esaú Villatoro; Acharya, Biswaranjan (CRC Press, 2022-09-13)
    • Neural sentiment analysis of user reviews to predict user ratings

      Gezici, Bahar; Bolucu, Necva; Tarhan, Ayca; Can, Burcu (IEEE, 2019-11-21)
      The significance of user satisfaction is increasing in the competitive open source software (OSS) market. Application stores let users send their feedbacks for applications, which are in the form of user reviews or ratings. Developers are informed about bugs or any additional requirements with the help of this feedback and use it to increase the quality of the software. Moreover, potential users rely on this information as a success indicator to decide downloading the applications. Since it is usually costly to read all the reviews and evaluate their content, the ratings are taken as the base for the assessment. This makes the consistency of the contents with the ratings of the reviews important for healthy evaluation of the applications. In this study, we use recurrent neural networks to analyze the reviews automatically, and thereby predict the user ratings based on the reviews. We apply transfer learning from a huge volume, gold dataset of Amazon Customer Reviews. We evaluate the performance of our model on three mobile OSS applications in the Google Play Store and compare the predicted ratings and the original ratings of the users. Eventually, the predicted ratings have an accuracy of 87.61% compared to the original ratings of the users, which seems promising to obtain the ratings from the reviews especially if the former is absent or its consistency with the reviews is weak.
    • Neural text normalization for Turkish social media

      Goker, Sinan; Can, Burcu (IEEE, 2018-12-10)
      Social media has become a rich data source for natural language processing tasks with its worldwide use; however, it is hard to process social media data due to its informal nature. Text normalization is the task of transforming the noisy text into its canonical form. It generally serves as a preprocessing task in other NLP tasks that are applied to noisy text. In this study, we apply two approaches for Turkish text normalization: Contextual Normalization approach using distributed representations of words and Sequence-to-Sequence Normalization approach using neural encoder-decoder models. As the approaches applied to Turkish and also other languages are mostly rule-based, additional rules are required to be added to the normalization model in order to detect new error patterns arising from the change of the language use in social media. In contrast to rule-based approaches, the proposed approaches provide the advantage of normalizing different error patterns that change over time by training with a new dataset and updating the normalization model. Therefore, the proposed methods provide a solution to language change dependency in social media by updating the normalization model without defining new rules.
    • New directions in the study of family names

      Hanks, Patrick; Boullón Agrelo, Ana Isabel (Consello da Cultura Galega, 2018-12-28)
      This paper explores and explains recent radical developments in resources and methodology for studying the origins, cultural associations, and histories of family names (also called ‘surnames’). It summarizes the current state of the art and outlines new resources and procedures that are now becoming available. It shows how such innovations can enable the correction of errors in previous work and improve the accuracy of dictionaries of family names, with a focus on the English-speaking world. Developments such as the digitization of archives are having a profound effect, not only on the interpretation and understanding of traditional, ‘established’ family names and their histories, but also of names in other languages and other cultures. There are literally millions of different family names in the world today, many of which have never been studied at all. What are good criteria for selection of entries in a dictionary of family names, and what can be said about them? What is the nature of the evidence? How stable (or how variable) are family names over time? What are the effects of factors such as migration? What is the relationship between family names and geographical locations, given that people can and do move around? What is the relationship between traditional philological and historical approaches to the subject and statistical analysis of newly available digitized data? The paper aims to contribute to productive discussion of such questions.
    • New versions of PageRank employing alternative Web document models

      Thelwall, Mike; Vaughan, Liwen (Emerald Group Publishing Limited, 2004)
      Introduces several new versions of PageRank (the link based Web page ranking algorithm), based on an information science perspective on the concept of the Web document. Although the Web page is the typical indivisible unit of information in search engine results and most Web information retrieval algorithms, other research has suggested that aggregating pages based on directories and domains gives promising alternatives, particularly when Web links are the object of study. The new algorithms introduced based on these alternatives were used to rank four sets of Web pages. The ranking results were compared with human subjects’ rankings. The results of the tests were somewhat inconclusive: the new approach worked well for the set that includes pages from different Web sites; however, it does not work well in ranking pages that are from the same site. It seems that the new algorithms may be effective for some tasks but not for others, especially when only low numbers of links are involved or the pages to be ranked are from the same site or directory.
    • News stories as evidence for research? BBC citations from articles, books and Wikipedia

      Kousha, Kayvan; Thelwall, Mike (John Wiley & Sons, 2017-07-17)
      Although news stories target the general public and are sometimes inaccurate, they can serve as sources of real-world information for researchers. This article investigates the extent to which academics exploit journalism using content and citation analyses of online BBC News stories cited by Scopus articles. A total of 27,234 Scopus-indexed publications have cited at least one BBC News story, with a steady annual increase. Citations from arts and humanities (2.8% of publications in 2015) and social sciences (1.5%) were more likely than citations from medicine (0.1%) and science (<0.1%). Surprisingly, half of the sampled Scopus-cited science and technology (53%) and medicine and health (47%) stories were based on academic research, rather than otherwise unpublished information, suggesting that researchers have chosen a lower quality secondary source for their citations. Nevertheless, the BBC News stories that were most frequently cited by Scopus, Google Books and Wikipedia introduced new information from many different topics, including politics, business, economics, statistics, and reports about events. Thus, news stories are mediating real world knowledge into the academic domain, a potential cause for concern.
    • Nine terminology extraction Tools: Are they useful for translators?

      Costa, Hernani; Zaretskaya, Anna; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam (MultiLingual, 2016-04-01)
      Terminology extraction tools (TETs) have become an indispensable resource in education, research and business. Today, users can find a great variety of terminology extraction tools of all kinds, and they all offer different features. Apart from many other areas, these tools are especially helpful in the professional translation setting. We do not know, however, if the existing tools have all the necessary features for this kind of work. In search for the answer, we looked at nine selected tools available on the market to find out if they provide the translators’ most favorite features.
    • NLP-enhanced self-study learning materials for quality healthcare in Europe

      Urbano Mendaña, Míriam; Corpas Pastor, Gloria; Seghiri Domínguez, Míriam; Aguado de Cea, G; Aussenac-Gilles, N; Nazarenko, A; Szulman, S (Université Paris 13, 2013-10)
      In this paper we present an overview of the TELL-ME project, which aims to develop innovative e-learning tools and self-study materials for teaching vocationally-specific languages to healthcare professionals, helping them to communicate at work. The TELL-ME e-learning platform incorporates a variety of NLP techniques to provide an array of diverse work-related exercises, selfassessment tools and an interactive dictionary of key vocabulary and concepts aimed at medics for Spanish, English and German. A prototype of the e-learning platform is currently under evaluation.
    • Not all international collaboration is beneficial: The Mendeley readership and citation impact of biochemical research collaboration

      Sud, Pardeep; Thelwall, Mike; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1SB UK; Statistical Cybermetrics Research Group; School of Mathematics and Computer Science; University of Wolverhampton; Wulfruna Street Wolverhampton WV1 1SB UK (Wiley Blackwell, 2015-05-13)
      Biochemistry is a highly funded research area that is typified by large research teams and is important for many areas of the life sciences. This article investigates the citation impact and Mendeley readership impact of biochemistry research from 2011 in the Web of Science according to the type of collaboration involved. Negative binomial regression models are used that incorporate, for the first time, the inclusion of specific countries within a team. The results show that, holding other factors constant, larger teams robustly associate with higher impact research, but including additional departments has no effect and adding extra institutions tends to reduce the impact of research. Although international collaboration is apparently not advantageous in general, collaboration with the USA, and perhaps also with some other countries, seems to increase impact. In contrast, collaborations with some other nations associate with lower impact, although both findings could be due to factors such as differing national proportions of excellent researchers. As a methodological implication, simpler statistical models would have found international collaboration to be generally beneficial and so it is important to take into account specific countries when examining collaboration.
    • NP animacy identification for anaphora resolution

      Orasan, Constantin; Evans, Richard (American Association for Artificial Intelligence, 2007)
      In anaphora resolution for English, animacy identification can play an integral role in the application of agreement restrictions between pronouns and candidates, and as a result, can improve the accuracy of anaphora resolution systems. In this paper, two methods for animacy identification are proposed and evaluated using intrinsic and extrinsic measures. The first method is a rule-based one which uses information about the unique beginners in WordNet to classify NPs on the basis of their animacy. The second method relies on a machine learning algorithm which exploits a WordNet enriched with animacy information for each sense. The effect of word sense disambiguation on the two methods is also assessed. The intrinsic evaluation reveals that the machine learning method reaches human levels of performance. The extrinsic evaluation demonstrates that animacy identification can be beneficial in anaphora resolution, especially in the cases where animate entities are identified with high precision.
    • Object and subject Heavy-NP shift in Arabic

      Mohamed, Emad (Research in Corpus Linguistics, 2014-12-31)
      In order to examine whether Arabic has Heavy Noun Phrase Shifting (HNPS), I have extracted from the Prague Arabic Dependency Treebank a data set in which a verb governs either an object NP and an Adjunct Phrase (PP or AdvP) or a subject NP and an Adjunct Phrase. I have used binary logistic regression where the criterion variable is whether the subject/object NP shifts, and used as predictor variables heaviness (the number of tokens per NP, adjunct), part of speech tag, verb disposition (ie. whether the verb has a history of taking double objects or sentential objects), NP number, NP definiteness, and the presence of referring pronouns in either the NP or the adjunct. The results show that only object heaviness and adjunct heaviness are useful predictors of object HNPS, while subject heaviness, adjunct heaviness, subject part of speech tag, definiteness, and adjunct head POS tags are active predictors of subject HNPS. I also show that HNPS can in principle be predicted from sentence structure.
    • Overview of the HASOC subtrack at FIRE 2021: Hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech

      Mandl, Thomas; Modha, Sandip; Shahi, Gautam Kishore; Madhu, Hiren; Satapara, Shrey; Majumder, Prasenjit; Schäfer, Johannes; Ranasinghe, Tharindu; Zampieri, Marcos; Nandini, Durgesh; et al. (Association for Computing Machinery, 2021-12-13)
      The HASOC track is dedicated to the evaluation of technology for finding Offensive Language and Hate Speech. HASOC is creating a multilingual data corpus mainly for English and under-resourced languages(Hindi and Marathi). This paper presents one HASOC subtrack with two tasks. In 2021, we organized the classification task for English, Hindi, and Marathi. The first task consists of two classification tasks; Subtask 1A consists of a binary and fine-grained classification into offensive and non-offensive tweets. Subtask 1B asks to classify the tweets into Hate, Profane and offensive. Task 2 consists of identifying tweets given additional context in the form of the preceding conversion. During the shared task, 65 teams have submitted 652 runs. This overview paper briefly presents the task descriptions, the data and the results obtained from the participant's submission.
    • Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization

      Safder, Iqra; Batool, Hafsa; Sarwar, Raheem; Zaman, Farooq; Aljohani, Naif Radi; Nawaz, Raheel; Gaber, Mohamed; Hassan, Saeed-Ul (Taylor & Francis, 2021-11-14)
      Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function's graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/.
    • Patent citation analysis with Google

      Kousha, Kayvan; Thelwall, Mike (Wiley-Blackwell, 2015-09-23)
      Citations from patents to scientific publications provide useful evidence about the commercial impact of academic research, but automatically searchable databases are needed to exploit this connection for large-scale patent citation evaluations. Google covers multiple different international patent office databases but does not index patent citations or allow automatic searches. In response, this article introduces a semiautomatic indirect method via Bing to extract and filter patent citations from Google to academic papers with an overall precision of 98%. The method was evaluated with 322,192 science and engineering Scopus articles from every second year for the period 1996–2012. Although manual Google Patent searches give more results, especially for articles with many patent citations, the difference is not large enough to be a major problem. Within Biomedical Engineering, Biotechnology, and Pharmacology & Pharmaceutics, 7% to 10% of Scopus articles had at least one patent citation but other fields had far fewer, so patent citation analysis is only relevant for a minority of publications. Low but positive correlations between Google Patent citations and Scopus citations across all fields suggest that traditional citation counts cannot substitute for patent citations when evaluating research.
    • Phrase level segmentation and labelling of machine translation errors

      Blain, Frédéric; Logacheva, Varvara; Specia, Lucia; Chair, Nicoletta Calzolari Conference; Choukri, Khalid; Declerck, Thierry; Grobelnik, Marko; Maegaard, Bente; Mariani, Joseph; Moreno, Asuncion; et al. (European Language Resources Association (ELRA), 2016-05)
      This paper presents our work towards a novel approach for Quality Estimation (QE) of machine translation based on sequences of adjacent words, the so-called phrases. This new level of QE aims to provide a natural balance between QE at word and sentence-level, which are either too fine grained or too coarse levels for some applications. However, phrase-level QE implies an intrinsic challenge: how to segment a machine translation into sequence of words (contiguous or not) that represent an error. We discuss three possible segmentation strategies to automatically extract erroneous phrases. We evaluate these strategies against annotations at phrase-level produced by humans, using a new dataset collected for this purpose.
    • The Portrait of Dorian Gray: A corpus-based analysis of translated verb + noun (object) collocations in Peninsular and Colombian Spanish

      Valencia Giraldo, M. Victoria; Corpas Pastor, Gloria (Springer, 2019-09-18)
      Corpus-based Translation Studies have promoted research on the features of translated language, by focusing on the process and product of translation, from a descriptive perspective. Some of these features have been proposed by Toury [31] under the term of laws of translation, namely the law of growing standardisation and the law of interference. The law of standardisation appears to be particularly at play in diatopy, and more specifically in the case of transnational languages (e.g. English, Spanish, French, German). In fact, some studies have revealed the tendency to standardise the diatopic varieties of Spanish in translated language [8, 9, 11, 12]. This paper focuses on verb + noun (object) collocations of Spanish translations of The Portrait of Dorian Gray by Oscar Wilde. Two different varieties have been chosen (Peninsular and Colombian Spanish). Our main aim is to establish whether the Colombian Spanish translation actually matches the variety spoken in Colombia or it is closer to general or standard Spanish. For this purpose, the techniques used to translate this type of collocations in both Spanish translations will be analysed. Furthermore, the diatopic distribution of these collocations will be studied by means of large corpora.
    • Predicting lexical complexity in English texts: the Complex 2.0 dataset

      Shardlow, Matthew; Evans, Richard; Zampieri, Marcos (Springer, 2022-03-23)
      Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
    • Predicting literature's early impact with sentiment analysis in Twitter

      Hassan, SU; Aljohani, NR; Idrees, N; Sarwar, R; Nawaz, R; Martínez-Cámara, E; Ventura, S; Herrera, F (Elsevier, 2019-12-14)
      © 2019 Elsevier B.V. Traditional bibliometric techniques gauge the impact of research through quantitative indices based on the citations data. However, due to the lag time involved in the citation-based indices, it may take years to comprehend the full impact of an article. This paper seeks to measure the early impact of research articles through the sentiments expressed in tweets about them. We claim that cited articles in either positive or neutral tweets have a more significant impact than those not cited at all or cited in negative tweets. We used the SentiStrength tool and improved it by incorporating new opinion-bearing words into its sentiment lexicon pertaining to scientific domains. Then, we classified the sentiment of 6,482,260 tweets linked to 1,083,535 publications covered by Altmetric.com. Using positive and negative tweets as an independent variable, and the citation count as the dependent variable, linear regression analysis showed a weak positive prediction of high citation counts across 16 broad disciplines in Scopus. Introducing an additional indicator to the regression model, i.e. ‘number of unique Twitter users’, improved the adjusted R-squared value of regression analysis in several disciplines. Overall, an encouraging positive correlation between tweet sentiments and citation counts showed that Twitter-based opinion may be exploited as a complementary predictor of literature's early impact.