The Pros and Cons of the Use of Altmetrics in Research Assessment

Many indicators derived from the web have been proposed to supplement citation-based indicators in support of research assessments. These indicators, often called altmetrics, are available commercially from Altmetric.com and Elsevier’s Plum Analytics or can be collected directly. These organisations can also deliver altmetrics to support institutional self-evaluations. The potential advantages of altmetrics for research evaluation are that they may reflect important non-academic impacts and may appear before citations when an article is published, thus providing earlier impact evidence. Their disadvantages often include susceptibility to gaming, data sparsity, and difficulties translating the evidence into specific types of impact. Despite these limitations, altmetrics have been widely adopted by publishers, apparently to give authors, editors and readers insights into the level of interest in recently published articles. This article summarises evidence for and against extending the adoption of altmetrics to research evaluations. It argues that whilst systematically-gathered altmetrics are inappropriate for important formal research evaluations, they can play a role in some other contexts. They can be informative when evaluating research units that rarely produce journal articles, when seeking to identify evidence of novel types of impact during institutional or other self-evaluations, and when selected by individuals or groups to support narrative-based non-academic claims. In addition, Mendeley reader counts are uniquely valuable as early (mainly) scholarly impact indicators to replace citations when gaming is not possible and early impact evidence is needed. Organisations using alternative indicators need recruit or develop in-house expertise to ensure that they are not misused, however.

For example, job interviewers in some fields might check the names or impact factors of journals mentioned on applicants' CVs to get a quick impression of their ability to produce high quality work (Campbell, 2008). At the opposite extreme, national policy makers are likely to rely on purely quantitative citation and output indicators to assess country-wide performance in comparison to competitors and in terms of trends over time (e.g., Gurney & Boucherie, 2017). Whilst it seems reasonable to count citations on the basis that, on average, citation counts reflect the extent to which publications have proven useful for subsequent research (Merton, 1973;Van Raan, 1998), they do not reflect impacts outside of academia. In the current climate of increasing pressure on researchers to demonstrate the societal impact of their research (the "impact agenda": Eynon, 2012), this is an important limitation.
Historically, the first systematic attempt to quantify non-academic research impacts may be patent analysis in the 1970s (Narin, 1994). The rationale was that patents offer commercial protection to novel inventions and so counting patents granted to universities or citations to academic research from patents might give commercial value indicators for research. This initiative was only partially successful because patents are not widely used in many industries, many patents have little real value, they do not capture the complexity of the innovation process even in industries where they are widely used (e.g., Adelman & DeAngelis, 2006), and the individual citations are problematic (Oppenheim, 2000). In addition, commercial value is only one type of non-academic impact. Researchers might also generate societal benefits by adding to culture or the arts, by improving health outcomes, by helping non-governmental organisations, or by supporting the various services of the state in other ways (Holmberg, Bowman, Bowman, Didegah, & Kortelainen, 2019). Thus, in an ideal world, there would be a wide range of indicators for all the different types of societal impacts that academic research can have.
In the absence of any non-academic impact indicators becoming widely used, with the partial exception of patents and patent citations, two decades ago the web was recognised as a potential new source of evidence, estimating the impacts of academic research from citations to it in various types of webpages. These new webometric indicators counted citations either from the entire web (Vaughan & Shaw, 2003) or from specific parts, such as online syllabi (Kousha & Thelwall, 2008) and Google Books (Kousha & Thelwall, 2009).
The rise of the social web led a decade ago to a renewed call for creating new societal impact indicators (Priem, Taraborelli, Groth, & Neylon, 2010). For example, since Twitter was used by a substantial minority of the population, it was argued that counts of tweets about academic research might be used as a new indicator of public interest in research, such as the societal impact of paediatric dentistry research through tweet counts (Garcovich & Adobes Martin, 2020). Altmetrics might also reflect public engagement, which is similar (Schultz, McKeown, & Wynn, 2020). Other factors being equal, research that attracted the public's attention might be most likely to have a positive societal impact, as well as giving earlier impact evidence due to the fast-moving nature of Twitter. This led to a range of public interest indicators from the social web, including counts of tweets, blog posts, and Facebook posts mentioning research. These were called altmetrics in recognition that they were potentially (complementary) alternatives to citations. Two companies were created that systematically gathered altmetrics and packaged them for use in academia, Altmetric (Liu & Adie, 2013) and Plum Analytics (Ortega, 2018), with their results not being identical (Bar-Ilan, Halevi, & Milojević, 2019;Ortega, 2018). Altmetrics tended to be easier to gather than webometrics since they could often be gathered fully automatically through applications programming interfaces (APIs) offered by the social web sites (and now sometimes through Crossref Event Data: Ortega, 2018), making them commercially viable in a way that webometric indicators were not. Nevertheless, the altmetric-based companies have also harnessed and adapted some webometric indicators to add to their altmetrics.
Today (Ma 2020), those needing to evaluate academic research and finding citations to be inadequate can either purchase alternative indicators from one of the commercial sellers or collect it themselves using a range of known methods. This article summarises the current advantages and disadvantages of alternative indicators for both societal and early impact.

Evidence so far
The theories and hypotheses about the potential of alternative indicators to reflect societal impacts need to be evaluated before the indicators can be used in practice. This is important because citation analysis research has discovered substantial hidden complexity in their use for evaluation (Moed, 2006) and it is likely that most alternative indicators have similar or more substantial issues because they are derived from sources that are not peer reviewed and do not derive from the relatively well understood scholarly publication process (Gamble, Traynor, Gruzd, Mai, Dormuth, & Sketris, 2020). More specifically, and taking the example of Twitter, the unknowns include how often typical academic research is tweeted, by whom, and why. In addition, it was not known whether human tweeting is dwarfed by Twitter bots, whether academic tweeters outnumber the public for when citing academic research, and whether the fraction of the public that tweet about research gives meaningful insights into public engagement with research. All these questions are difficult to answer and are compounded by likely disciplinary differences in the use of Twitter to engage with academic research.
In the face of the above complex and interlinked issues, a set of standard strategies have been adopted to evaluate alternative indicators . These sacrifice depth for practicality and address relatively easily tested properties. The following four strategies, listed in descending order of popularity, are the most common.
• Correlation of alternative indicators with citation counts, with statistically significant positive values being taken as evidence of the value of the alternative indicator. This is almost a paradox because the point of an alternative indicator is to give different information from citation counts. Nevertheless, the main test to evaluate them is a correlation test for whether they give overlapping information. This test is justified on the grounds that (a) almost any genuine impact indicator ought to correlate with citation counts since, other factors being equal, more impactful research is more likely to attract citations from follow-up studies, and (b) statistically significant correlations are evidence that the alternative indicators are at least not random, which would otherwise be a distinct possibility. • Prevalence of alternative indicators, with higher proportions of non-zero scores being taken as evidence of greater utility. Indicators that return a score of 0 for nearly all journal articles have little discriminatory power and so are not useful for many research evaluation tasks. • Content analysis of citer motivations, with the prevalence of impact-type motivations giving evidence of face validity. This applies to indicators that have weak face validity, such as tweets, but not to indicators like syllabus mentions that have a clear interpretation (educational value in this case). • Surveys of users, with impact-related motivations giving evidence of face validity.
• Predicting future citation counts with earlier alternative indicators, giving evidence of predictive power. Some or all the above have been used to assess a range of different indicators. The evidence is summarised below, from the strongest to the weakest indicator. Composite indicators, such as the Altmetric.com overall score should not be used for formal evaluations because the individual component indicators can be selected instead for a more meaningful analysis.

Mendeley readers
Mendeley is a social reference sharing site that allows users to record academic documents that they are interested in and then helps them to build reference lists from them (Gunn, 2014). The number of people that have registered a document in Mendeley is known as its Mendeley reader count on the basis that most users register documents that they have read or intend to read (Mohammadi, Thelwall, & Kousha, 2016), and is an altmetric (Li, Thelwall, & Giustini, 2012). About 1 in 20 researchers use Mendeley (Van Noorden, 2014) so its reader counts underestimate the number of readers for an article. These readers tend to be junior researchers or students (Mohammadi, Thelwall, Haustein, & Larivière, 2015) and so Mendeley reader counts reflect scholarly and partly educational impact (except for mathematics: Thelwall, 2017c) rather than societal impact. The value of Mendeley reader counts is as early indicators of academic impact because readers appear a year before citations Thelwall & Sud, 2016). This is possible because Mendeley is unaffected by the citing article publishing delays that slow citations.
There is strong evidence in support of the use of Mendeley as an early impact indicator for journal articles in all academic fields. Mendeley reader counts correlate strongly or moderately with citation counts in all academic fields after a few years (so there are enough citations for comparisons) (Thelwall, 2017b) and are at least as common as citations (Thelwall, 2017b;Zahedi, Costas, & Wouters, 2014). Mendeley readers also have moderate positive correlations with expert judgements of the quality of research (HEFCE, 2015). Early Mendeley readers correlate positively with longer term citations so they can be used to predict eventual citation counts (Thelwall & Nevill, 2018;Thelwall, 2018). Mendeley readers can also be useful for conference papers in fields where they are important (Thelwall, 2020), and are useful, but less prevalent, for books and dissertations (Kousha & Thelwall, 2019).

Health website citations
Health and biomedical publications have the richest alternative indicators because of the proliferation of online healthrelated sites that cite academic research. Some of these can be mined for high quality citation information. High quality websites typically cite a small fraction of the literature, but each citation can give valuable direct evidence of societal benefits. These include websites for clinical trials , national guidelines for health professionals (Kryl, Allen, Dolby, Sherbon, & Viney, 2012; , and directories of medical drug information (Thelwall, Kousha, & Abdoli, 2017). Post-publication peer review impact type labels in the F1000 biomedical website are also potential source of evidence of societal impact for biomedical research (Bornmann, & Leydesdorff, 2013;Mohammadi & Thelwall, 2013).

Google Books citations
Traditional citation indexes, including the Web of Science and Scopus, primarily index academic journal articles but also index some conference papers, magazines, books and other outputs. Research that is drawn upon by other books more than by journal articles will therefore have its impact underestimated by traditional citation counts. This problem can be resolved by using Google Books as an indirect citation index by combining searches for citation metadata with results filtering. In book-based fields, this gives robust results that are more numerous than Scopus and the Web of Science, and the procedure can also be used to capture citations to books (Kousha & Thelwall, 2015).

Online syllabus mentions
Academic research in some fields can attract a substantial audience of undergraduates or postgraduates, if it provides accessible information about a topic that is taught in universities. A simple way to evaluate the educational value of an academic output would be to count how many course syllabi mention it. Whilst most syllabi are presumably private, a substantial minority are on the public web and citations from them to specific journal articles or books can be obtained by using appropriate search engine queries Mas Bleda & Thelwall, 2018).

Wikipedia citations
The free public encyclopaedia Wikipedia is a repository of a wide range of academic and other information, and part of its function is to convey scholarly knowledge to a non-specialist public. It also seems to summarise many academic topics in ways that would be useful for academics in other fields. Citations from Wikipedia may therefore represent endorsements of the importance of research contributions from the perspective of the public or non-specialist researchers. Since a low proportion (5%) of recent academic articles have been cited in Wikipedia and correlations between Wikipedia citation counts and Scopus citation counts are low (but statistically significant and positive) (Kousha, & Thelwall, 2017a), they may have limited value for some types of impact assessment when there are large numbers of documents to evaluate. Wikipedia citation counts might be characterised as indicators of information impact, although this is a vague term.

Blogs
Science blogs often discuss journal articles and other public research to either critique it or to translate it for a nonscientific audience (Shema, Bar-Ilan, & Thelwall, 2015). They are rare, occurring for 6% of recent articles (estimate from combining: Thelwall, Haustein, Larivière, & Sugimoto, 2013;Zahedi, Costas, & Wouters, 2014). Citations from blogs have a weak positive correlation with citation counts (Thelwall, Haustein, Larivière, & Sugimoto, 2013) and blog citations in the year of publication of an article can be used to predict longer term citation counts (Shema, Bar-Ilan, & Thelwall, 2014), so blog citations are robust impact indicators. Like Wikipedia citations, their scarcity is a major drawback for many practical applications.

Patents
Patents contain citations to other patents and sometimes to academic research to help explain the invention or similar innovations. Since the role of a patent is financial, a citation from a patent to an academic output is an indicator of relationship to commercial value. The Derwent Patent Citation Index is an example of a citation index that can be used for patent citation analysis (Takano, Mejia, & Kajikawa, 2016). Whilst patent citations are not usually characterised as a type of altmetric, they can be gathered from the Google Patents website and so can be a webometric indicator. Patents are rare however, with under 1% of journal articles receiving a patent citation in most fields, although the proportion may reach 7%-10% for Biomedical Engineering, Biotechnology, and Pharmacology & Pharmaceutics (Kousha & Thelwall, 2017b). Patent citation counts have a low but positive correlation with citation counts. Because they have reasonable face validity, this suggests that, when present, they reflect a dimension of commercial impact or value for academic research.

Grey literature citations
Research targeting commercial, government, or non-governmental organisations may be more likely to be cited by grey literature than by journal articles and their impact may therefore not be reflected by traditional citation counts. Grey literature seems to be often posted online as a free white paper, leaflet or report (especially in economics: Mili, 2000). Although it is possible to count citations from online grey literature to some extent by querying Google or Bing for PDF files citing academic research, the results can mix educational and academic documents with other literature and so do not have high face validity (Wilkinson, Sud, & Thelwall, 2014). Altmetric.com extracts citations from some government websites where they can reasonably be taken to represent governmental influence, however. These grey literature outputs are also themselves cited by academic research (Bickley, Kousha, & Thelwall, 2019).

Tweets
Twitter allows users to make frequent short posts, which were originally limited to 144 characters. These tweets could be used to post links to academic research. They typically include the article title or a brief summary but rarely include a judgement or an explanation of why an article might be useful (Holmberg & Thelwall, 2014;Thelwall, Tsou, Weingart, Holmberg, & Haustein, 2013). Two thirds of recent articles have been tweeted (Zahedi, Costas, & Wouters, 2014). Tweets have low positive or negative correlations with citation counts and are therefore unreliable indictors of any type of impact, however (Haustein, Larivière, Thelwall, Amyot, & Peters, 2014;Thelwall, Haustein, Larivière, & Sugimoto, 2013). A direct study found low correlations (0.09 overall) between tweet counts and expert judgements of the quality of UKauthored journal articles (HEFCE, 2015), for example, which is too low for most practical uses. According to one survey, most users tweeting links to journals articles are not in academia (Mohammadi, Thelwall, Kwasny, & Holmes, 2018), with tweeting academics sometimes attempting to reflect a specialist authority through Twitter (Joubert & Costas, 2019). Overall, tweet counts are common and may reflect a combination of attention or publicity for articles but there is little evidence that they reflect general public interest or any other specific type of impact.

Thelwall: The Pros and Cons of the Use of Altmetrics in Research Assessment
Art. 2, page 5 of 9

Facebook wall posts
Whilst the majority of Facebook activity probably occurs in private groups, altmetrics have only been collected from public pages. Facebook wall posts are short news-like posts that may announce or briefly discuss academic publications. Public Facebook wall posts linking to academic articles, as collected by Altmetric.com, are relatively rare, and occur for about of 12% recent articles in Altmetric.com data (estimate from combining: Thelwall, Haustein, Larivière, & Sugimoto, 2013;Zahedi, Costas, & Wouters, 2014). Public Facebook wall posts have a very weak positive correlation with citation counts (0.05), suggesting that they may have little value, perhaps being mainly used for publicity. On the positive side, only 4% of a sample of Facebook accounts posting public links to health or medical journal articles were individual academics, with a majority (58%) not being related to academia (Mohammadi, Barahmand, & Thelwall, 2020). Thus, there is some evidence that Facebook wall posts might reflect non-academic interest in research, but public posts are rare and lack convincing evidence of their value as an indicator.

Others
A variety of other webometric and altmetric indicators have been proposed and investigated and more are likely to appear in the future. In addition, other alternative indicators can be used to estimate the reach or impact of non-standard academic outputs, such as blogs, videos, software and datasets (Konkiel, 2013;Piwowar, 2013). These are normally excluded in citation analyses but can be useful products of research. For example, TED Talks are high profile and sometimes translate academic research for a general audience (Romanelli, Cain, & McNamara, 2014), and some academics produce high quality popular YouTube videos to popularise science (Haran & Poliakoff, 2011).

Advantages of altmetrics
Early impact evidence: In practice, the most important advantage of many alternative indicators is that they give early impact evidence. Informally, they might be consulted by academics for their own recently published articles to check whether they are receiving any social media attention, whether for personal feedback or impact evidence for a CV (Piwowar & Priem, 2013). For formal research evaluations, early impact evidence can help to shorten the delay between conducing research and being able to evaluate it, whether evaluating individual researchers, departments, universities, or funding programmes. This allows more recent research to be evaluated and allows indicators to support decision-making at a stage when publications are too young to have attracted citations (Thelwall, Kousha, Dinsmore, & Dolby, 2016). The early advantage applies to altmetrics but not most webometrics, since these typically appear more slowly. In one innovative application, Mendeley was used for early impact evidence in a randomised control trial of the effect of publicity on interest in medical articles (Kudlow, Cockerill, Toccalino, Dziadyk, Rutledge, Shachak, & Eysenbach, 2017). Wider impact evidence: All altmetrics and webometrics reflect impact that is at least partly different from citation impact. If all types of research impacts should be valued, then alternative indicators give the potential to access quantitative evidence about a wider range of impacts than citation counts alone. Wider output types: Alternative indicators can also be used for quantitative evidence of the impact of non-standard outputs, such as YouTube videos and grey literature, for which citation counts are unavailable or not relevant. Finer-grained impact context: A few alternative indicators can give fine-grained impact context, such as the nationalities, occupations and subject areas of interest of the readers of articles (Thelwall & Maflahi, 2015;Mohammadi & Thelwall, 2014).

Disadvantages of altmetrics
Difficulty collecting: Whilst altmetrics may be obtained from a commercial provider on a large scale, most webometrics are time-consuming to collect. Given that there are many different webometrics and data collection is not straightforward, this is probably the biggest obstacle to their use in practice. In parallel, a lack of people trained in altmetrics or webometrics affects the time needed to identify and gather them. There seems to be an increased awareness of altmetrics (Aung, Zheng, Erdt, Aw, Sin, & Theng, 2019) and this may lead to increased knowledge and willingness to learn how to use them effectively. Low coverage: Many alternative indicators are non-zero for a small minority of articles, weakening their power to differentiate between the average impacts of sets of outputs. Thus, they may only be useful for large document sets. For example, patent citations are rare but common enough to be used to compare universities for a dimension of technological impact (Orduna-Malea, . Altmetrics seem to be most prevalent and most useful in health-related fields, but also relatively prevalent in the humanities, social sciences, and life sciences (Costas, Zahedi, & Wouters, 2015). Difficulty with field normalisation: Alternative indicator scores are difficult to assess without benchmark values, such as obtained through field normalisation (Thelwall, 2017a). Generating enough data for field normalisation or benchmarking against other groups increases the amount of data required. Field normalisation can use subject categories from traditional citation indexes but indicators for non-standard outputs are likely to need an alternative method to classify them by subject. In practice, field normalisation is probably rarely used for alternative indicators and so the influence of fields must be taken into account by evaluators, for example by not comparing health-related altmetrics to mathematics-related altmetrics.
Incomplete and biased coverage of impact areas: No alternative indicator is guaranteed to capture evidence of any type of impact. They all also have biases due to the method with which they are created or used. For example, tweet counts as an indicator of public interest are biased against people that don't use Twitter, including most of China. International biases can influence comparisons between countries (Fairclough & Thelwall, 2015;Orduna-Malea & López-Cózar, 2019), including international biases in terms of the data gathered by commercial altmetric providers (Ortega, 2020). Incomplete coverage of impact types: Some types of societal impact are not captured by any alternative indicator and so a set of articles could have societal impact and still score zero on all altmetrics. For example, research designed to improve farming methods in developing nations seems extremely unlikely to leave an altmetric trace reflecting its uptake by local farmers. Lack of quality control: Almost all alternative indicators are susceptible to deliberate or accidental manipulation and therefore cannot be used for evaluations where those evaluated are aware of the assessment method in advance (Wouters & Costas, 2012). Related to this and the above issues, researchers may feel that altmetric-related evaluations undermine them since they inadequately capture impact dimensions (Regan, & Henchion, 2019).

Conclusions and recommendations
Whilst there are many advantages and disadvantages of altmetrics and webometrics, they cannot compete with peer review for assessing research quality, or with citation counts as a robust quantitative indicator to support peer review or to replace it in contexts where peer review is impractical or undesirable. Alternative indicators have most value in contexts when citations are insufficient, which is primarily when non-academic impacts need to be assessed, when early impact evidence is needed, or when non-standard outputs are to be assessed. In these contexts, evaluators need to consider the likely added value of altmetrics in terms of whether they are capable of giving enough of the type of evidence needed for the evaluation and, if so, whether the cost of obtaining them (from a commercial provider or gathering them) justifies the value that they provide. Given the above limitations, alternative indicators should only be used to inform human judgements and not replace them. In addition, the human judges need to be aware of their limitations when interpreting them. Inappropriate uses can potentially damage the research system that they are trying to measure, whether by generating unintended consequences or by demoralising those evaluated (Wilsdon, Allen, Belfiore, et al., 2015).
Organisations that may potentially get value from altmetrics include research funders, departments and universities, but they can also be applied to any collection of scholarly outputs for other purposes, such as to assess interest in an academic journal (Barbic, Tubman, Lam, & Barbic, 2016). Organisations that need alternative indicators may need to employ appropriately trained scientometricians to gather them or understand offering from commercial providers and to safeguard against inappropriate interpretations of them (i.e., responsible use of metrics). Alternatively, organisations should ensure that a member of the evaluation team learns how to gather/process and evaluate alternative indicators so that they can be used, when relevant, but are not given too much weight.