Data Integration#
Data integration is a task performed on multiple datasets to produce a single unified data set that can be used in analysis. In a FAIR scenario, the user of the data is not a known quantity, and they make the decisions about how to integrate the data they are working with. In some cases a data integration is performed to present harmonised data to users. In either case, the fashion in which the data have been integrated from various sources needs to be described, and the process steps recorded. This might include a detailed description of the processing and transformations performed, along with information about the methods, etc, and this crosses into what is typically considered ‘provenance’ metadata. In this initial profile, we do not attempt to specify approaches for this more-complete description.
This discussion is focused on data that can be represented in tabular formats with simple literal values. While this does not cover all the data we wish to describe, it represents a significant portion of it. Much of the data made available for reuse is expressed in CSV or similar, text-based formats, and these provide the initial focus for describing data, intended as a useful starting point. In future, this coverage will be expanded.
The general intention is that domain specific metadata can be mapped into an ‘integration-ready’ metadata scheme for exchange across domain and infrastructure boundaries, and that this mapping can be done with an acceptable degree of effort, in some cases automated completely. This integration-ready metadata needs to provide a description of the data types, controlled vocabularies which provide the semantics, and codes used individual fields in the data, possibly supplemented with separate expressions of mapping between controlled vocabularies used. Given this granular description of the data, it is possible to describe the mappings used to merge data sets to enable automation of data integration functions and reduce the high cost of ‘data wrangling’. While it is likely that data integration will always require some attention and input from the researcher, many of the necessary tasks are routine and can be automated if sufficient information is known about the data themselves. CDIF attempts to provide a sufficient level of metadata to support such automation.
Some examples of how these transformations can be achieved are provided. A further section explains how the production of integrated data sets can be described, although CDIF does not provide recommendations at this point for doing so.