Tools to support reproducibility and evaluation of research

An update on progress with recent projects

As research practice and communication moves increasingly online, the scholarly record is expanding in terms of content and scope. In addition to the traditional text-based outcomes of journal articles and books it now includes data sets, software and materials generated in the process of research. A complete research record helps to ensure the integrity of the research process and – as recognised in the research integrity concordat –  funders, researchers and those who employ researchers share the responsibility to support the highest standards of rigour and integrity.

In recent years there have been growing concerns about the completeness of the scholarly record which affects the reproducibility of research. One way to find out if new research findings are reliable is to repeat the original research that produced them. If this fails, further questions need to be asked about the validity of the original research.

At Jisc we explored a couple of aspects around these concerns to see what role there might be for tools to improve the reproducibility of research.

In partnership with the University of Edinburgh we explored how mining data from articles on animal-based research helps to  detect which factors influencing reproducibility (such as sample size calculation, control group allocation, compliance with ARRIVE guidelines for animal research) are reported in these articles. If researchers, funders, publishers and institutions had a dashboard helping them to see where these key factors are not being declared they could design interventions, such as awareness raising and training, to help improve the situation. Together with our partners we will now consider next steps to further develop the dashboard, also to ensure that it is used in a responsible way. We are also considering how the participatory design and development approach of Jisc analytics labs, which led this work, might be useful for a wide range of other questions of research policy and practice.

Sharing primary data collected during a research study supports reproducibility but, in many cases, the implementation of data sharing maybe less effective than apparent. Together with the University of Wolverhampton and representatives from the UK Reproducibility Network (UKRN) we explored the extent of data sharing of summary statistics of primary human-genome wide association studies (GWAS) as an example of data sharing in favourable circumstances and whether such checks can be automated. Articles sharing GWAS summary statistics usually reported this in a data availability statement within the article. We found that out of 330 articles classified as GWAS in PubMed only 10.8 % reported sharing GWAS summary statistics in some form increasing substantially from 4.3% in 2010 to 16.8% in 2017. Information about whether data was shared can be extracted from data availability statements, but it is more problematic to identify the exact nature of the shared data. Data availability statements are vague about what is shared and there is no standard or policy by journals in a specific field regarding what should be included. Descriptions of the exact nature of the data available would help not only automation but also researchers to find relevant data for a new study. We are now exploring in what contexts automated approaches to extract data availablity statemets could be useful and how representative the GWAS example is in a community meeting.

The completeness of research publications is also a necessary condition for effective peer review, both for journals and for funders so that they can determine the robustness of research findings or proposals. While peer review continues to play a pivotal role in validating research results it has come under strain due to a number of factors. It is perceived as slow and inefficient, seen as a potential source for bias and there have been increasing retractions of peer reviewed papers. Working with the University of Wolverhampton we’ve explored how technology could be used to support reviewers and editors. As part of the Jisc open metrics lab we have experimented with sentiment analysis of F1000 open peer review reports to build a tool which can detect positive and negative evaluations in these reports.
Transparency may influence the extent or nature of judgement biases, and there is now growing evidence about the implications of various open peer review models. The project also tested if national affiliations of article authors and referees reading previously published open peer review reports have an influence on peer review judgements. The evidence from this work could help with building confidence in the open peer review process. We also discussed opportunities and ethical challenges around the use of AI technologies in peer review in a briefing paper. We are currently reviewing the outputs from this project and will publish them shortly on this blog.

Universities have an interest in most forms of peer review as they are influenced by its outcomes. Some universities are directly involved in peer review to prepare their submissions for the Research Excellence Framework (REF). We have been working with the University of Bristol on novel ways to support institutions preparing for funder research assessment. The aim of this work was to develop and evaluate a prediction market tool that universities can use to rank outputs for potential REF submissions as part of their internal REF planning. The idea behind this is that the probability of events can be measured in terms of the bets people are willing to make. Prediction markets have been used to forecast for example elections or film revenues and are used here to predict ratings of research outputs.  We expect this approach may be most valuable as part of an evaluation pipeline, including machine learning, prediction markets and close reading by reviewers and, with Bristol, we are continuing pilots to establish what factors influence the best composition of this pipeline in different contexts.

Citations are another aspect of the scholarly record and metrics derived from citation data continue to play a strong role in research assessment at least for journal articles in most STEM disciplines. It is good to see that there has been a world-wide push back against inappropriate use of metrics to evaluate research such as the use of the Journal Impact factor (JIF). In the UK, the Forum for Responsible Research Metrics  provides advocacy and leadership on the responsible use of research metrics. The Forum also supported an experiment around metrics for OA monographs as part of the Jisc open metrics lab. In HSS disciplines there is a whole range of issues with the use of citation metrics including e.g. the coverage of these disciplines within conventional databases that are commonly used for bibliometric analysis, diverse citation cultures and the importance of local language etc. However, researchers in these disciplines do cross-reference bibliographies of books to find out which works are cited, for example to understand new disciplinary spaces. At the moment, researchers would need to go to a library to do this. As open access books and the policies and models that support them are growing there are now opportunities to help researchers to do this in a quicker way. We are working with Birkbeck, University of London to develop a tool that extracts references from OA monographs and shows which items are cited in common among the selected titles. We have shown that it is feasible to build this type of tool and the next step will now be to consider the results in a community meeting in October. The project report will be published together with the recommendations from the meeting.

The recent launch of the Research on Research Institute gives all stakeholders in the sector a huge opportunity to improve our understanding of the complexities in the research and innovation system.  This is particularly important as both Science Europe and the European Universities Association survey their members about research assessment practices, reflecting a widespread view that those practices may not contribute as much as they could to a healthy research culture and to research integrity.  Jisc looks forward to playing our part in this exciting movement.  For more information about our work in this area, please contact or

Research Analytics Webinar

Monday, 1 July 2019, 10:00-11:30
Webinar Recording

The aim of this webinar is to highlight recent results and discuss the progress of our R&D work in the area of research analytics.

The webinar will consist of four short presentations including demonstrations of recent work and will outline future plans for a potential research analytics service. This will be followed by a Q&A session which will offer an opportunity for questions and feedback on the planned service.

Read an overview of Jisc’s work in the area of research analytics on the Jisc scholarly communications blog.

Who should attend?
Research administrators, managers and leaders
Researchers interested in the research process and management
Research librarians
Research funders


, Chris Keene, Head of library and scholarly futures, Jisc
Overview of Jisc’s R&D work in the area of Research Analytics
Presentation slides

Analytics Labs, Adam Green, Senior data and visualisation officer, Jisc.
Jisc Analytics Labs is an approach to the development of decision-making tools underpinned by data. This presentation will briefly outline this approach and then focus on the results of the reproducibility lab which used data from articles on animal-based research to assess the degree to which factors affecting research reproducibility are reported
Presentation slides

Data availability study
, Mike Thelwall, Professor of Data Science, University of Wolverhampton
Primary data collected during a research study is increasingly shared and may be re-used for new research. The aim of this project was to assess the extent of data sharing of summary statistics of primary human genome-wide association studies (GWAS) as an example of data sharing in favourable circumstances in a particular discipline and whether such checks can be automated. This presentation will summarise the findings of the project and demonstrate a tool to extract information from data availability statements
Presentation slides

Prediction market
, Jackie Thompson, Research Associate, University of Bristol
The aim of this project was to develop and evaluate a prediction market tool that higher education institutions can use to rank outputs for potential REF submissions as part of their internal REF (Research Excellence Framework) planning. A prediction market is a bit like the stock market, except instead of investing in companies, participants invest in the outcomes of future events (in this case, ratings of research outputs). This presentation will give some background to the project and details of the prediction markets that have been tested with Units of Assessment at the University of Bristol. It will include a demo of the tool used and the lessons learned from the first round of markets.
Presentation slides

Research Analytics service,
Rob Johnson, Research Consulting
Jisc’s plans for a potential new research analytics service have started with a discovery phase to help define the problems around research analytics as the starting point to possible solutions. At the end of this phase there will be a brief defining the work required to produce a research analytics service. We have been working with a number of institutions and stakeholders to explore the problems faced by institutional leaders, managers, professionals and academic staff concerning the planning, management and evaluation of research, where better analytic insight would help address these problems. This presentation will highlight the progress made defining these problems, what we have learnt and plans for the next stage in the discovery process.
Presentation slides


Open-Access Monographs and Metrics: More than counting beans

Guest Post by Martin Paul Eve

If one were to create a ranking of terms feared by those working in the humanities, “bibliometrics” would have to be up there. For differing, non-comprehensive citation cultures accompanied by long citation half-lives that are not usually seen in the natural sciences mean that, when bibliometrics are used for assessment purposes, they simply don’t work well in the humanities disciplines. When a book takes five years to write, for example, one won’t see a citation network that reflects the current state of a field within the types of timescale that are useful to research funders, for instance. *

Yet, if those in the humanities do not want bibliometrics to be used for assessment, we are all actually already used to using the citation graph in another type of utilitarian exercise: cross-referencing in order to gain an understanding of a field. For example, whenever I need to get my head around a new field of scholarship, I have a tried and tested method. I will usually go to the British Library and order ten or so books that seem to have pertinent titles. I will then begin to cross-reference the bibliographies of these books. In other words, I want to know: what do these titles cite in common? What, exactly, are the key secondary works that are cited by all of these books? It is my gamble that  the most-cited items will be good pieces to read in order to rapidly understand a new disciplinary space.

This is a labour intensive process. It involves my move to a physical space in the first place – our national research library – which on its own has implications for accessibility; as a disabled academic, I am not always in a brilliant state to make my way into a physical library space. This is then followed by a search of the catalogue, a wait for the delivery of the items, and then a laborious process of note taking, observation and cross-referencing across hundreds of permutations of bibliographic entries.

What if, in the contemporary digital publishing landscape, there were a better way? For many years now, there has been a steady growth in the number of academic books that are published open access; that is, free of price and permission barriers. Free to read and free to re-use. Several thousand of these are listed in the Directory of Open Access Books (DOAB), providing an ever-expanding corpus of high-quality, peer-reviewed monographs that are openly and digitally accessible.

It is with great pleasure, then, that with funding from Jisc’s Open Metrics Lab, the Centre for Technology and Publishing at Birkbeck can today announce our experimental project to build a bibliographic intersect tool for open-access monographs. The project has three components that Jisc is planning to make available for anyone to re-use:

  1. A literature review of existing material on bibliometrics for open-access monographs and bibliographic intersection tools;
  2. A tool that will allow people to download a corpus from the DOAB;
  3. A tool that will parse references from open-access monographs and tell the user which items are cited in common among the selected titles.

As this is a tool that I have wanted for some time in my own capacity as a researcher, it is excellent to have support from Jisc in beginning the development work on this. That said, we have had to impose some limitations. While there are excellent tools like and Crossref’s citation resolution service – which we intend to use – we are going to have to work with a small subset of citations to begin with. Guaranteeing the universal parsing of arbitrary free-text input from any publisher in any style is well beyond the scope of this experimental exercise.

However, this is an exciting start to show what a citation graph – essentially, metrics for monographs – might achieve within a positive research context for the humanities. Rather than counting beans in order to assess researchers, we are interested in using the quantitative and cumulative weight of citation evidence as a way to accelerate the research process, to help with disability access, and to think through the capabilities of open access for our understanding of new areas.

Martin Paul Eve

About the author: Martin Paul Eve is Professor of Literature, Technology and Publishing at Birkbeck, University of London. He is a founder of the Open Library of Humanities, a member of the UUK Open Access Monographs Working Group, and author of Open Access and the Humanities: Contexts, Controversies and the Future, published openly by Cambridge University Press.


*Featured image: “the future of books” by Johan Larsson, used under the terms of a Creative Commons Attribution license.

Data availability and feasibility of validation

Can we develop an automated way to assess the availability of research data for a collection of journal articles and assess the extent to which the data are being made available in a FAIR way? *

Data sharing is important for academic research, both for validation of results and for re-use to address new research questions. A growing number of policies encourage data sharing to varying degrees but, in many cases, the implementation of data sharing maybe less effective than apparent. Thus, new insights on the pain points faced by researchers in sharing data and the needs of readers could serve as a basis to promote good practice in data sharing. Can new ways of evaluating the effectiveness of data sharing help to improve practice?

To take an example, many publishers require the author to include a data availability statement in a publication explaining how the relevant data can be accessed. ‘Availability’, however, can be interpreted in different ways leading to different results in terms of who and how the data can be accessed. Ideally, the data underlying research should be findable, accessible, interoperable, and reusable (FAIR) so that other researchers can locate and reuse the data in a meaningful way.

To help answer this, we are working with researchers from the Universities of Wolverhampton and Bristol to carry out a study to explore how authors are sharing the data associated with their research. We will examine the full text and data availability statements from a collection of articles to assess the availability of the underlying data and then consider the extent to which the data meet certain quality criteria in terms of format, reuse etc. The study will also explore the possibility of creating a method or indicator for the evaluation of research data sharing practice to help understand what this means in a particular discipline, and to support the agenda around recognising data as valuable output from the research process.

The study will include the following steps:

1. Identify and then assemble a corpus of research articles from a research discipline for which a specific type of research data should be available (in certain disciplines community standards require sharing of a particular data type and have a common standard for reporting data).
2. Assess whether data that were reported to be available (e.g. in a repository) can actually be found there.
3. Consider the means by which the data is shared. For example, is it adequate in terms of format, metadata provision, for reuse?
4. Devise an approach for reporting on the above tests in a concise form (i.e. develop an indicator).
5. Investigate the feasibility of scaling up or building a generalizable pipeline for similar analysis in other disciplines.

The study would look to automate steps 2-4 for a given corpus of research articles (with full text available) within the selected research discipline.

We decided to focus on genome-wide association studies (GWAS) as data type. A GWAS is a study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (e.g. smoking behaviour) or the presence of a disease or condition. GWAS data is widely reused and there are strong community norms to share this type of data. There are also likely to be issues with the ‘availability’ and the format in which they are shared. The research involving GWAS data is often undertaken by large consortia which means that data needs to be shared within the research group which makes it a smaller step to share them more widely.

Project Team
University of Wolverhampton
– Mike Thelwall, Professor of Data Science
– Kayvan Kousha, Postdoctoral researcher
– Amalia Mas Bleda, Postdoctoral researcher
– Emma Stuart, Postdoctoral researcher
– Meiko Makita, Postdoctoral researcher
– Nushrat Khan, PhD student

University of Bristol
– Marcus Munafò, Professor of Biological Psychology
– Katie Drax, PhD student
Marcus and Katie are also representing the UK Reproducibility Network  (@ukrepro)

The project runs from January 2019-July 2019 and we will share updated on this blog along with other experiments as part of the open metrics lab.

*Featured image: “Share” by Carlos Maya, used under the terms of a Creative Commons Attribution license.

Leaving the gold standard

Guest Post by Cameron Neylon

See also a briefing paper written by Cameron Neylon for Jisc on the Complexities of Citation.

Citations, we are told, are the gold standard in assessing the outputs of research. When any new measure or proxy is proposed the first question asked (although it is rarely answered with any rigour) is how this new measure correlates with the “gold standard of citations”. This is actually quite peculiar, not just because it raises the question of why citations came to gain such prominence, but also because the term “gold standard” is not without its own ambiguities. * *

The original meaning of “gold standard” referred to economic systems where the value of currency was pegged to that of the metal; either directly through the circulation of gold coins, or indirectly where a government would guarantee notes could be converted to gold at a fixed rate. Such systems failed repeatedly during the late 19th and early 20th centuries. Because they coupled money supply – the total available amount of government credit – to a fixed quantity of bullion in a bank, they were incapable of dealing with large-scale and rapid changes. The Gold Standard was largely dropped in the wake of World War II and totally abandoned by the 1970s.

But in common parlance “gold standard” means something quite different to this fixed point of reference, it refers to the best available. In medical sciences the term is used to refer to treatments or tests that currently are regarded as the best available. The term itself has been criticised over the years, but it is perhaps more ironic that this notion of “best available” is actually in direct contradiction to intent of the currency gold standard – that value is fixed to a single reference point for all time.

So are citations the best available measure, or the one that we should use as the basis for all comparisons? Or neither? For some time they were the only available quantitative measure of the performance of research outputs. The only other quantitative research indicator being naive measures of output productivity. Although records have long been made of journal circulation in libraries – and one time UK Science Minister David Willetts has often told the story of choosing to read the “most thumbed” issue of journals as a student –  these forms of usage data were not collated and published in the same ways as the Science Citation Index. Other measure such as research income, reach, or even efforts to quantify influence or prestige in the community have only been available for analysis relatively recently.

If the primacy of citations is largely a question of history, is there nonetheless a case to be made that citations are in some sense the best basis for evaluation? Is there something special about them? The short answer is no. A large body of theoretical and empirical work has looked at how citation-based measures correlate with other, more subjective, measures of performance. In many cases at the aggregate level those correlations or associations are quite good. As a proxy at the level of populations citation based indicators can be useful. But while much effort has been expended on seeking theories that connect individual practice to citation-based metrics there is no basis for the claim that citations are in any way better (or to be fair, any worse) than a range of other measures we might choose.

Actually there are good reasons for thinking that no such theory can exist. Paul Wouters, developing ideas also worked on by Henry Small and Blaise Cronin, has carefully investigated the meaning that gets transmitted as authors add references, publishers format them into bibliographies, and indexes collect them to make databases of citations. He makes two important points. First that we should separate the idea of the in text reference and bibliographic list – the things that authors create – from the citation database entry – the line in a database created by an index provider. His second point is that, once we understand the distinction between these objects we see clearly how the meaning behind the act of the authors is systematically – and necessarily – stripped out by the process. While we theorists may argue about the extent to which authors are seeking to assign credit in the act of referencing, all of that meaning has to be stripped out if we want citation database entries to be objects that we can count. As an aside the question of whether we should count them, let alone how, does not have an obvious answer.

It can seem like the research enterprise is changing at a bewildering rate. And the attraction of a gold standard, of whatever type, is stability. A constant point of reference, even one that may be a historical accident, has a definite appeal. But that stability is limited and it comes at a price. The Gold Standard helped keep economies stable when the world was a simple and predictable place. But such standards fail catastrophically in two specific cases.

The first failure is when the underlying basis of trade changes, when the places work is done expands or shifts, when new countries come into markets, or when the kinds of value being created changes. Under these circumstances the basis of exchange changes and a gold standard can’t keep up. Similar to the globalisation of markets and value chains, the global expansion of research and the changing nature of its application and outputs with the advent of the web puts any fixed standard of value under pressure.

A second form of crisis is a gold rush. Under normal circumstances a gold standard is supposed to constrain inflation. But when new reserves are discovered and mined hyperinflation can follow. The continued exponential expansion of scholarly publishing has lead to year on year inflation of citation database derived indicators. Actual work and value becomes devalued if we continue to cling to the idea of a citation as a constant gold standard against which to compare ourselves.

The idea of a gold standard is ambiguous to start with. In practice citation data-based indicators are just one measure amongst many, neither the best available – whatever that might mean – nor an incontrovertible standard against which to compare every other possible measure. What emerges more than anything else from the work of the past few years on responsible metrics and indicators is the need to evaluate research work in its context.

There is no, and never has been, a “gold standard”. And even if there were, the economics suggests that it would be well past time to abandon it.

A briefing paper written for Jisc by Cameron Neylon – “The Complexities of Citation: How theory can support effective policy and implementation” – is available open access from the Jisc Repository.

Cameron Neylon

Cameron Neylon

About the author: Cameron Neylon is an advocate for open access and Professor of Research Communications at the Centre for Culture and Technology at Curtin University. You can find out more about his work and get in touch with Cameron via his personal page Science in the Open.



*Featured image: “A real bag of gold” by cogdogblog@flickr, used under the terms of a Creative Commons Attribution license.

All citations are created equal

(Only some are more equal than others)

Today we introduce you to one of the Jisc-funded PhD. students working at The Knowledge Media Institute (KMi), which is a part of the Open University and is located in Milton Keynes. David Pride is one of the team working as a part of the joint Jisc/OU CORE project (COnnecting REpositores) which offers Open Access to over eight million research papers.
David Pride completed his MSc. in Computer Science (with distinction) at The University of Hertfordshire in 2016 before starting his PhD. at KMi in February of this year. David’s PhD. supervisor is Dr. Petr Knoth and his thesis topic is looking at web-scale research analytics for identifying high performance and trends in academic research. In short, this involves using state-of-the-art Text and Data Mining techniques to analyse datasets containing millions of academic papers to attempt to identify highly impactful and influential research. * *

At KMi, all PhD. students must complete a pilot project study within their first year. For his, David chose to undertake a review of several previous studies that have attempted to automatically categorise citations according to type, sentiment and influence. Current bibliometrics methods, from the renowned Journal Impact Factor (JIF) to the h-index­ for individual authors, treat all citations equally. There is much empirical evidence demonstrating that treating citations all equally in this manner means that basic citation counts do not reflect the true picture of how a paper may be being used. A piece of research may be highly cited because of its ground-breaking content or because it introduces a new methodology. However, it could also be highly cited because it is a survey paper that provides a rich background to a particular domain. Conversely, a paper may engender citations that refute or disagree with the original work. Whilst most citations are overtly neutral in sentiment there is a certain percentage of negative citations. Yet, currently, all these citations are treated equally.

David’s work is also focused on developing new metrics that can leverage the full content of an academic paper to evaluate its quality rather than relying on citation counts alone.  He therefore continued the work of previous studies in using machine learning and natural language processing tools to automatically classify citations according to type and ‘influence’. Influence itself is an interesting concept and, in this case, refers to how influential the cited paper was on the citing paper, i.e. was the citation central to understanding the new work or was it perfunctory, or merely mentioned as part of the literature review for example. If information regarding how a paper is being cited is available to academics, researchers and reviewers this provides a much richer insight than currently available with basic citation counts.

Building on the work of Valenzuela et al. (2015) and Zhu et al. (2015) David developed a system to classify citations in a paper as either incidental or influential. Despite running into several difficult steps along the way, the results of the experiments were overall extremely positive and the resulting short paper was presented at the TPDL (Theory and Practice of Digital Libraries) 2017 Conference and was published in the Springer Lecture Notes on Computer Science. A full version of the paper was later accepted to the ISSI (International Society of Scientometrics and Informetrics (2017) where David presented his results to conference in Wuhan, China.

Moving forward, David intends to address one of the major failings in this domain which is the lack of a massive scale human-annotated dataset of citations to use when training classifiers for this task. It is believed that the results obtained previously can be significantly improved with a larger initial training set. Citation data is unbalanced in nature, negative citations for example representing only about 4% of all citations. Training a classifier to accurately identify these citations requires a dataset of sufficient magnitude to contain enough examples of every class. A large-scale reference set which contains citations annotated according to type, sentiment and influence would be an extremely valuable asset for researchers working in this domain.

In the coming months, David will also be researching the peer review process and how well this correlates with current methodologies for tracking research excellence. He has some  interesting data he is currently looking at and we’re looking forward to seeing what he produces in 2018!

*Featured image: “measurement” by flui., used under the terms of a Creative Commons Attribution license.