Data availability and feasibility of validation

Can we develop an automated way to assess the availability of research data for a collection of journal articles and assess the extent to which the data are being made available in a FAIR way? *

Data sharing is important for academic research, both for validation of results and for re-use to address new research questions. A growing number of policies encourage data sharing to varying degrees but, in many cases, the implementation of data sharing maybe less effective than apparent. Thus, new insights on the pain points faced by researchers in sharing data and the needs of readers could serve as a basis to promote good practice in data sharing. Can new ways of evaluating the effectiveness of data sharing help to improve practice?

To take an example, many publishers require the author to include a data availability statement in a publication explaining how the relevant data can be accessed. ‘Availability’, however, can be interpreted in different ways leading to different results in terms of who and how the data can be accessed. Ideally, the data underlying research should be findable, accessible, interoperable, and reusable (FAIR) so that other researchers can locate and reuse the data in a meaningful way.

To help answer this, we are working with researchers from the Universities of Wolverhampton and Bristol to carry out a study to explore how authors are sharing the data associated with their research. We will examine the full text and data availability statements from a collection of articles to assess the availability of the underlying data and then consider the extent to which the data meet certain quality criteria in terms of format, reuse etc. The study will also explore the possibility of creating a method or indicator for the evaluation of research data sharing practice to help understand what this means in a particular discipline, and to support the agenda around recognising data as valuable output from the research process.

The study will include the following steps:

1. Identify and then assemble a corpus of research articles from a research discipline for which a specific type of research data should be available (in certain disciplines community standards require sharing of a particular data type and have a common standard for reporting data).
2. Assess whether data that were reported to be available (e.g. in a repository) can actually be found there.
3. Consider the means by which the data is shared. For example, is it adequate in terms of format, metadata provision, for reuse?
4. Devise an approach for reporting on the above tests in a concise form (i.e. develop an indicator).
5. Investigate the feasibility of scaling up or building a generalizable pipeline for similar analysis in other disciplines.

The study would look to automate steps 2-4 for a given corpus of research articles (with full text available) within the selected research discipline.

We decided to focus on genome-wide association studies (GWAS) as data type. A GWAS is a study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (e.g. smoking behaviour) or the presence of a disease or condition. GWAS data is widely reused and there are strong community norms to share this type of data. There are also likely to be issues with the ‘availability’ and the format in which they are shared. The research involving GWAS data is often undertaken by large consortia which means that data needs to be shared within the research group which makes it a smaller step to share them more widely.

Project Team
University of Wolverhampton
– Mike Thelwall, Professor of Data Science
– Kayvan Kousha, Postdoctoral researcher
– Amalia Mas Bleda, Postdoctoral researcher
– Emma Stuart, Postdoctoral researcher
– Meiko Makita, Postdoctoral researcher
– Nushrat Khan, PhD student

University of Bristol
– Marcus Munafò, Professor of Biological Psychology
– Katie Drax, PhD student
Marcus and Katie are also representing the UK Reproducibility Network  (@ukrepro)

The project runs from January 2019-July 2019 and we will share updated on this blog along with other experiments as part of the open metrics lab.

*Featured image: “Share” by Carlos Maya, used under the terms of a Creative Commons Attribution license.

Print Friendly, PDF & Email

Leave a Reply

The following information is needed for us to identify you and display your comment. We’ll use it, as described in our standard privacy notice, to provide the service you’ve requested, as well as to identify problems or ways to make the service better. We’ll keep the information until we are told that you no longer want us to hold it.
Your email address will not be published. Required fields are marked *