Standards for data Description

Standards for data Description#

Many domain-specific standards have excellent descriptions of the data used within their domains, and some of these separate the semantic aspects of data description from the structural ones . CSV on the Web is one candidate, as is the W3C’s Model for Tabular Data and Metadata on the Web. Frictionless Data provides a similar lightweight description of data. More metadata-rich models also exist, such as the SDMX Information Model and the RDF DataCube Vocabulary which is based on it, but these are largely limited to describing multi-dimensional data. Although the DDI Codebook and DDI Lifecycle specifications are very close to the CDIF requirements, they are also inherently bound to their intended use for describing social, behavioural, and economic (SBE) data. Only the related DDI Cross-Domain Integration (DDI-CDI) specification meets all of these requirements, largely because it was designed specifically to describe cross-domain data for the purposes of integration.

CSV on the Web is designed to add metadata to CSV tables, which is a use case not dissimilar to the one we address in CDIF. The problems arise when we consider the flexibility of the CSV format itself: there are no limitations in CSV regarding the logical organisation of the contents of the table: is each row a unit being observed, and each column a characteristic of that unit? Or are the columns instead the units? This logical certainty is required. Although CSV on the Web allows us to add richness to the meanings of columns and rows of a table, it is in this sense too flexible to make a good foundation for the scenario addressed by CDIF. We can know what concepts are in play, and how rows and columns are organised, but we cannot understand how the presentation of the data — the table organisation — is logically structured. What roles do the concepts play? While this could perhaps be reverse-engineered, it is easier to explicitly state how the concepts logically relate, and how these in turn are presented in a tabular form. A similar critique can be applied to Frictionless Data’s data description, and to the Model for Tabular Data and Metadata on the Web. These specifications fail to cover all the needed logical relationships, being restricted to tabular descriptions that combine logical and presentational aspects of the data.

SDMX and the RDF DataCube Vocabulary have a different issue. They have stronger formal models for how concepts intersect with data structure, and do not combine the presentational and logical aspects of data description in the same way, but they are limited because they insist on a multi-dimensional description of data sets. While SDMX 3.0 has introduced a ‘microdata’ description feature, this is still new, and is not yet the version adopted by most implementations, nor is it the version that is the basis for the RDF DataCube Vocabulary.

SDMX demands that the metadata description be provided using a model that might not be supported by disseminator’s systems. Many data repositories do not manipulate data as multi-dimensional cubes, and lack the information needed to describe their data in this fashion. Further, SDMX uses a very disciplined cube definition: all data has regular dimensionality. Many systems based on multi-dimensional models have irregular dimensionality and ‘sub-cubes’, which are not permitted in SDMX. Given that SDMX is an exchange model for official statistics, this disciplined approach is very reasonable, but is not appropriate for CDIF, where existing data stores must be described according to metadata that already exists.

Both DDI Codebook and DDI Lifecycle are excellent models in terms of being generic-but-concept-rich data descriptions, but they lack the range of support needed for CDIF. In the SBE realm, data is overwhelmingly stored and processed as ‘wide’ data files: unit record data, where each record is a set of measurements or values about a single unit, one per row.

The DDI Cross-Domain Integration specification was developed exactly because it is increasingly common for other types of data to be combined in research projects with this traditional form of SBE data. Four different data structure types are identified: wide, long, multi-dimensional, and key-value (the kind of data commonly found in ‘big data’ systems). Because they are described in a single model, intended to support transformations between these different structural types, the needed separation of logical and presentational aspects of the data was a necessity. DDI-CDI might seem to be a complex model be experience has shown that attempts to produce simpler models based on existing W3C specifications produce equal complexity, effectively new specifications in and of themselves. The framework recommends DDI-CDI specification for data description as the right approach for CDIF.