Provenance - Cross Domain Interoperability Framework Handbook

Describing the provenance of data and other FAIR resources is potentially a very broad topic, including a wide range of metadata of different types. For the purposes of CDIF, we have chosen to limit the topic in the same way that the popular PROV recommendations from W3C deso: “Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.” (from the "PROV Overview at http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/). Provenance is a critical topic for the sharing of FAIR resources across domain boundaries, as the knowledge of data and sources may not be as thorough as that of resources from within the domain, and the determination of reliability and trust may be more difficult.

The most important aspect of provenance to record - and the one on which the existing standards focus - is the process whereby data or another resource has been created and processed. There are several popular standards for describing historical processes: the W3C PROV family of recommendations are very popular, but are not so much a single standard model as they are a framework for describing any kind of provenance. They typically require - and are designed for - a degree of specialization to make them relevant within a domain. This has the impact of making PROV descriptions potentially problematic for cross-domain use. Other popular standards which support description of historical process include Schema.org, the Common Workflow Language (CWL), and the process portion of DDI-CDI. For the purposes of CDIF, CWL is not ideal because XML is its only syntax representation, while all the others have at a minimum an RDF vocabulary which can be expressed as JSON-LD, like all of CDIF (DDI-CDI has both standard XML and RDF syntax representations.) In almost all cases, the weakness found with PROV - the need for specialization to have broadly meaningful process descriptions outside a specific domain - was also present with other standards. As a result, it is not practical for CDIF to select a single standard for implementation, and recommend a set of the available fields for implementation of CDIF from it.

The CDIF Working Group has been actively working on provenance for several years, and it is expected that a recommended profile will be forthcoming in the near term. The work has been conducted in a bottom-up fashion, looking at examples of data collection and production across many different domains. From these, a more generic description of typical activities, resources, and actors has been developed. The implementation of this “cross-domain” description of process is potentially possible across several of the mentioned standards, and the idea that the CDIF provenance model will be essentially the same regardless of the standard vocabulary used for describing the historical process (e.g., PROV, Schema.org, or DDI-CDI) is being explored. Further, reference models of data production such as the UN/ECE’s Generic Statistical Business Process Model (GSBPM) are being explored in implementation projects which are using CDIF (such as Climate-Adapt4EOSC).

Some basic fields for capturing provenance exist within the Discovery profile, taken from the W3C PROV vocabulary, but these are minimal. A more complete set of recommendations regarding provenance can be expected soon the the CDIF WG completes their work on this topic.

We appreciate constructive feedback. Contact us at cdif-feedback@codata.org or file a GitHub Issue.