Blockchain technology in Open Science data infrastructures

Introduction

We believe that Blockchain technologies (BCTs) can play an important role in developing open science (OS) data infrastructures. The main argument is that blockchains can help implementing legal and ethical requirements, among them the FAIR principles of OS (Findable, Accessible, Interoperable, and Reusable data), and in particular security functions such as integrity, confidentiality and authentication of data as well as prevent falsification or misuse. Using encryption techniques, timestamping and hash functions (the hash value as result of a hash function described as a digital “fingerprint”), BCT offers various ways to protect and secure source data as well as program code that result from the research. To protect program code or other digital representation of methods and procedures may be difficult by traditional database techniques, while BCT provides appropriate tools as e.g. digital fingerprints, smart contracts, tokens etc.

In a short-term perspective, we do not see that OS data themselves may be stored on a blockchain. Rather, OS data will be kept in traditional databases, as e.g. on an OS data cloud, or stored in a distributed systems such as Interplanetary File System (IPFS) and other similar solutions. Following this we argue that selected metadata may be stored on-chain, such as data descriptors (title, keywords, etc.), author identification credentials, possible licenses and conditions for use, etc. In addition, one may include a digital fingerprint for later verification and, if necessary, also authentication. In this way, researchers can make their data accessible by an access key stored on-chain, creating a quasi-immutable record of initial ownership, and even encode ‘smart’ contracts[1] or tokens to license the use of data. In the case of program code; by storing an access code and a checksum of the code on-chain, it will be possible to prevent, or at least hamper, misuse or forgery (Smith & Sandbrink, 2022).[2]

Mapping metadata linked to open science data

When discussing what types of metadata may be relevant to store on-chain, we need a systematic mapping and classification of relevant metadata along with the research data itself. In addition to the necessary descriptive and identifying metadata, various metadata related to organizational and legal matters are required. The context will be the GDPR requirements, the FAIR principles, as well as other relevant legal and research ethics/research integrity requirements, e.g. the IPR (Intellectual property) legislation.

One fruitful way to categorise metadata may be to follow the structure used in the EU EOSC Interoperability Framework, which distinguishes four layers:

  1. Technical: Metadata describing security and privacy requirements, formats, syntax, software details, etc.
  2. Semantic: Description of concepts, metadata, data schemes in standardized ways such as the W3C recommendation Linked Data expressed in RDF (Resource Description Framework), OWL (Web Ontology Language), SKOS (Simple Knowledge Organisation System) and other standards
  3. Organisational: Descriptive metadata: Title, authors, research field/discipline, source, publisher, license information, managerial issues…
  4. Legal: GDPR compliance and license requirement in machine-readable format, restriction data access

Standards for metadata of types 1 and 2 are e.g. Dublin Core, while metadata of types 3 and 4 will be dependent on the type of research area. As stated above, we will also need metadata to describe relevant requirements related to the GDPR regulation as well as how to comply with the FAIR principles.

Departing from these categories, the following classification scheme can be used:

Table of the categories of metadata relevant to Open Science Data, examples of data elements and the requirements. The types of metadata are: legal, organizational, semantic and technical.
Categories of metadata relevant for OS Data

Notes

[1] Smart contracts are simple programs stored on a blockchain that run when predetermined conditions are met. They are typically used to automate the execution of an agreement so that all participants can be immediately certain of the outcome, without any intermediary’s involvement.

[2] Smith, J. A., & Sandbrink, J. B. (2022). Biosecurity in an age of open science. PLoS Biology, 20(4): e3001600. doi: 10.1371/journal.pbio.3001600


This passage is part of D6.3: Comparison of existing blockchain technologies to safeguard responsible OS written by Arild Johan Jansen & Svein Ølnes.