Schema.org implementation of CDIF metadata#

JSON-LD has been chosen as the recommended serialization format for CDIF metadata following our principle to use existing mainstream technology. The JSON format is widely used for data serialization and popular with developers. JSON-LD adds additional syntax for the representation of linked data, compatible with existing JSON implementations so that integration with existing applications is relatively frictionless. Many metadata providers are using the schema.org vocabulary with JSON-LD serialization for metadata publication and interchange. Use of this format provides a low barrier to entry for data providers.

The JSON syntax is defined by the ECMA JSON specification, and JSON-LD is specified in the JSON-LD 1.1 recommendation from the World Wide Web Consortium (W3C). This serialization is designed for linked data applications that will translate the JSON into a set of {subject, predicate, object} triples that can be loaded into an RDF database for processing. The JSON-LD context binds JSON keys to URIs for more precise semantics, and the use of URIs to identify entities and property values in the metadata will maximize the linkage with resources on the wider web to build an ever-expanding global knowledge graph.

The metadata about the resource has properties about the resource like title, description, responsible parties, spatial or temporal extent (as outlined in the Metadata Content Requirements section).

In a harvesting/federated catalog system some metadata about the metadata is useful to keep track of where metadata came from, what format/profile it uses (harvesters need this to process), and update dates see Metadata Content Requirements. Unambiguous expression of this information requires making statements about a metadata record distinct from the thing in the world that the metadata describes. In an RDF framework, this requires a distinct identifier for the metadata record object that will serve as the subject for these triples.

Schema.org includes several properties that can be used to embed information about the metadata record in the resource metadata: sdDatePublished, sdLicense, sdPublisher, but lacks a way to provide an identifier for the metadata record distinct from the resource it describes, to specify other agents responsible for the metadata except the publisher, or to assert specification or profile conformance for the metadata record itself.

In the RDF serialization, Schema.org metadata records are JSON-LD node objects, and include an “@id” keyword with a value that identifies the node, analogous to a primary key in a relational database. This identifier can be interpreted to represent a thing in the world that the metadata record (the ‘node’) is about, or to represent the metadata record (a JSON object) itself.

To avoid this ambiguity, CDIF adopts the convention that the schema.org identifier property is used to identify a thing in the world that is the subject of the JSON-LD node. The identified thing might be physical, imaginary, abstract, or a digital object. The JSON-LD @id property identifies a node in a graph, which is an abstract object. As a URI the @id URI is expected to dereference to produce a JSON-LD object containing the properties that are attached to the graph node. Given this convention, when the metadata record is processed, the processor should use the schema:identifier as subject of triples about the subject of the metadata record to avoid ambiguity. In addition, this convention would suggest that if a schema:identifier property is present, the @id property should be interpreted to identify the JSON object that is the representation of the node in the knowledge graph.

Statements about the metadata record (the JSON object) as a distinct entity should be made using a separate identified node object. This node object can be embedded in the metadata record about the resource in the world (Example 1 below), or published as a separate node (Example 2 below). Note that this second approach is like the DCAT CatalogRecord.

{   "@context": [
        "https://schema.org",
        {"dcterms": "http://purl.org/dc/terms/",
         "ex":"https://example.com/99152/"
        }
    ],
    "@id": "ex:URIforNode1",
    "@type": "appropriate schema.org type",
	"identifier":"ex:URIforDescribedResource",
    "name": "unique title for the resource",
    "description": "Description of the resource",
    "subjectOf": {
        "@id": "ex:URIforNode2",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
		"identifier":"ex:URIforNode1",
        "description":"metadata about documentation for ex:URIforDescribedResource",
    	"dcterms:conformsTo": {"@id":"CDIF_basic_1.0"}
	}        
   }

Example 1. Metadata about the metadata embedded.

{
    "@context": [
        "https://schema.org",
        {"ex": "https://example.com/99152/"}
    ],
    "@graph": [
        {
            "@id": "ex:URIforNode1",
            "@type": "Dataset",
            "identifier": "ex:URIforDescribedResource",
            "name": "unique title for the resource",
            "description": "Description of the resource"
        },
        {
            "@id": "ex:URIforNode2",
            "@type": "DigitalDocument",
            "dateModified": "2017-05-23",
            "identifier": "ex:URIforNode1",
            "description": "metadata about documentation for ex:URIforDescribedResource",
            "dcterms:conformsTo": {"@id": "CDIF_basic_1.0"}
        }
    ]
}

Example 2. Metadata about metadata as a separate graph node.

The ex namespace in the example above is only included so the example is valid; actual metadata would likely have its own namespace for resource and metadata URIs. The distinct identifier for the metadata record (ex:URIforNode1) allows statements to be made about the metadata separately from statements about the resource it describes.

Note that the @type for the metadata node (root node) is ‘DigitalDocument’. This is a schema.org type that corresponds broadly to the concept of DigitalObject as used by the Fair Digital Object (FDO) community (Bonino et al., 2022 ), recognizing that the metadata record is a digital object.

JSON keys prefixed with ‘@’ are keywords defined in the JSON-LD specification (see table below)

Keyword

Description

@context

The value of the context is an object that specifies set of rules for interpreting the JSON-LD document. The rules can be specified inline in, or via a URI that identifies a context object containing a set of rules.

@id

A string that identifies the subject of the assertions in the JSON object that contains the @id key.

@type

An identifier for the definition of the structure of the JSON object that contains the @type key. The type determines what keys or values should be expected in the JSON object that contains the key. Values are types defined in the schema.org vocabulary. In the CDIF framework (and for compatibility with FDOF FDOF digitalObjectType), the schema:additionalType property should be used (see implementation table below)

Implementation of metadata content items#

The following table maps the metadata content items described in the Metadata Content Requirements section to the schema.org JSON-LD keys to use in metadata serialization. Some example metadata documents follow. The ‘Obl.’ column specifies the cardinality obligation for the property; ‘1’ means one value required; 1..* means at least one value is required; 0..* means the property is optional and more that one value can be provided. Properties with path from “subjectOf” describe the metadata.

CDIF content
item
Obl. Schema.org
implementation
Scope note
Metadata identifier 1 "subjectOf"/"@id":{URI} or "@id":{uri} in node with "identifier":"@id" of the node containing the resource description The URI for the metadata record should be the \@id value for the 'subjectOf' element in the JSON instance document tree or "@id":{uri} in a separate graph node with "identifier":"@id" of the node containing the resource description
Resource identifier 1 "identifier":{URI} The URI for the resource that is the subject of the metadata record should be the "identifier": value for the root of the JSON instance document tree
Title 1 "name":{string} A set of words that should uniquely identify the described resource for human use, in the scope of the metadata catalog containing this metadata record.
Distribution 1 "url":{URL} If metadata is about a single digital object
"distribution":
{ "@type": "DataDownload",
"contentUrl": {URL },\... }
If the metadata is about an abstract, non-digital, or physical resource that has multiple distributions, with different URL, encodingFormat, conformsTo properties. Each distribution is considered a distinct digital object. The dataDownload MUST include the contentUrl, and SHOULD include encodingFormat, dcterms:conformsTo to specify the media type and specification or profile documenting the specific serialization conventions for the download content.
Rights 1..* "license":{text or URI}
Or
"conditionsOfAccess":{text or URI}
URL to license document or text explanation of restrictions on use. There might be multiple links to documents specifying related security, privacy, usage, sharing, etc... concerns.
Metadata profile identifier 1 "subjectOf"/"dcterms:conformsTo": {identifier} Use Dublin Core terms property. The value for Base CDIF metadata is 'CDIF_basic_1.0' [tbd; this should be a PID]. Different profiles extending this must define unique identifier strings to use here. Note that the schema.org schemaVersion is used to indicate the version of the schema.org vocabulary, but in general this is not needed for CDIF.
Metadata date 0..1 "subjectOf"/"dateModified":{Date or DateTime} Use ISO8601 format. The most recent update date for the metadata content. Harvesters use this to determine if they have already harvested and processed this record.
Metadata contact 0..1 / "subjectOf"/"maintainer":{Person or Organization} Should include a name and contact point (institutional e-mail is best) for the agent responsible for metadata content. This is the contact point to report problems with metadata content. Person and Organization are Agent objects with various properties.
Resource type 1 "@type":{schema.org type} Use the most specific [Schema.org resource type](https://schema.org/docs/full.html) that is applicable. Multiple value can be provided but they must be logically consistent.
0..* "additionalType": [{DefinedTerm or URI}, ...] If a more specific resource type needs to be specified, add a text or URI value here that identifies the type. MUST be consistent with the \@type. To simplify parsing, always encode as an array.
Description 0..1 "description": {string} Free text, with as much detail as is feasible
Originators 0..* "creator" : [{Person or Organization}, ...] The value is a schema.org person or organization. To simplify parsing, always encode as an array. Use ORCID or other PID to identify person or organization where possible
Publication Date 0..1 "datePublished" : {date time} Date on which the resource was made publicly accessible. Use ISO 8601 format.
Modification Date 1 "dateModified" : {date time} Date of most recent update to resource content. If Publication date is not provided, defaults to the Modification Date. Use ISO 8601 format.
Keyword 0..* "keywords":
[ {string},
{"@type":"DefinedTerm",
"name": "OCEANS",
"inDefinedTermSet": "gcmd:sciencekeywords",
"identifier": "gcmd:concept/916b....6167d" },...]
Implement with text for tags, and schema:DefinedTerm for keywords from a controlled vocabulary. The DefinedTerm approach is used to represent concepts.
GeographicExtent Required if resource has a geographic extent for its subject, a bounding rectangle, line, or point. To support cross-domain searches based on geospatial location, location coordinates must be given in decimal degrees using the WGS 84 datum. There are various other systems for describing location; these can be provided as alternate location descriptions, recognizing that they might not be meaningful to some metadata harvesting agents.
Named place 0..* "spatialCoverage": { "@type": "Place",
"name": {string} or {schema:DefinedTerm} }
To specify location with place names; if the names are from a gazeteer, use the schema:DefinedTerm to provide a name, identifier, and inDefinedTermSet to fully document the concept.
Bounding box 0..1 "spatialCoverage": {
"@type": "Place",
"geo": { "@type": "GeoShape",
"box": "39.3280 120.1633 40.445 123.7878" } }
For bounding box specification of the spatial extent of resource content. See [ESIP SOSO for details](https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#bounding-boxes). Recommend including only one bounding box; behavior of harvesting clients when multiple geometries are specified is unpredictable.
Curvilinear trace 0..1 "spatialCoverage": {
"@type": "Place",
"geo": { "@type": "GeoShape",
"line": "39.33 120.77 40.44 123.96 41.00 121.34" } }
For resource related to a linear trace like a ship track or airplane flight line
Point location 0..1 "spatialCoverage": {
"@type": "Place",
"geo": { "@type": "GeoCoordinates",
"latitude": 39.3280,
"longitude": 120.1633 } }
For a point location specification of the spatial extent of resource content. Recommend including only one point; behavior of harvesting clients when multiple geometries are specified is unpredictable.
Other serialization 0..* "geosparql:hasGeometry": {
"@type": "sf#Point",
"geosparql:asWKT": "@type":#wktLiteral",
"@value":"POINT(-76 -18)"},
"Geosparql:crs": {"@id":"CRS84"} }
Optional geographic extent using other more interoperable geometries, GeoSPARQL us recommended, see Ocean InfoHub. (Note URIs in example are truncated...) Other geometry schemes might be specified in a specific domain profile, e.g. for atmospheric, subsurface data, or local coordinate systems.
Distribution
Distribution Agent 0..* "provider":{Person or Organization} Contact point for the provider of a distribution. For a simple digital object with a download URL, or a resource with multiple distributions all from the same provider.
0..* "distribution": [ { "@type": "DataDownload","provider":{Person or Organization} }...] If there are multiple distributions with different providers, each distribution can have a separate provider
Variables in the data Required for datasets. The metadata about a dataset should include a list of variables that the dataset contains. Variable metadata should minimally specify the name of the variable as it appears in the dataset. That name should be, ideally, qualified by a controlled vocabulary or other semantic resource (e.g. represented by a resolvable URI), or minimally some descriptive text.
Variable (PropertyValue) 0..* "variableMeasured":
[ { "@type":"PropertyValue",
  "@id": "astm:var0011",
  "propertyID": [ "pato:PATO_0000025",
   "astm:prop/0405" ],
  "name": "hostMineral",
  "description": "...." }...]
Follow ESIPfed Science on Schema.org recommendation, see also discussion for representing more complex data structures in ESIPfed Experimental and the Data Integration module of CDIF. Variable must have a name and description, should have a propertyID with URI for the represented concept. The URI in the propertyID provides the semantic linkage for meaning of the variable.
Variable (StatisticalVariable) 0..* "variableMeasured":
[ { "@type":"StatisticalVariable",
"@id": "astm:var0011",
"@type": "StatisticalVariable",
 "measuredProperty":
  {"@type":"Property",   "identifier":"astm:id/305978",
  "name":"Average age"}]
Statistical variable offers properties useful for describing social science statistical variables like populationType and statType. Use of StatisticalVariable is preferred for variables with values calculated from some aggregation process.
Temporal coverage 0..* Temporal coverage can be expressed in several ways: a calendar/clock dateTime or date time interval using ISO8601 serialization, a named time ordinal era, an interval bounded by time ordinal era, or with a numeric coordinate in a temporal reference system.
"temporalCoverage": "2018-01-22" Calendar data or clock time instant use ISO8601 encoding
"temporalCoverage": "2012-09-20/2016-01-22" Calendar data or clock time interval use ISO8601 encoding
"temporalCoverage":
[{ "@type":"time:ProperInterval",
"time:intervalStartedBy": "isc:LowerDevonian,
"time:intervalFinishedBy": "isc:LowerPermian" }]
Time ordinal era interval, use owl:time namespace, time: http://www.w3.org/2006/time#. This example uses International chronostratigraphic chart, isc. See PeriodO for identifiers for many other named time intervals.
"temporalCoverage":
[{ "time:ProperInterval- 345/298 Ma" }]
For time interval specified using geologic ages, in Ka, Ma or Ga; The text string is an abbreviated owl time interval (proposal, under discussion)
Related agents (contributor role) 0..* "contributor": [ {Person or Organization}, ... ] Recognition for others who have contributed to the production of the resource but are not recognized as authors/creators.
Related agent (other role) "contributor": {"@type": "Role",
  "roleName": "Principal Investigator",
 "contributor": {"@type": "Person",  "@id": "https://orcid.org/...",
  "name": "John Doe",
  "affiliation": {"@type": "Organization",
   "@id": "https://ror.org/...",
   "name": "..."},
  "contactPoint": {"@type": "ContactPoint",
   "email": "john.chodacki@ucop.edu"}
To assign roles to contributors like editor, maintainer, publisher, point of contact, copyright holder (e.g. DataCite contributor types), use the rather convoluted role construction defined by schema.org
Related resources 0..* "relatedLink": [{"@type":"LinkRole", "linkRelationship": "...",
"target: {"@type": "EntryPoint",
"encodingType": "text/html",
"name": "...",
"url": "https://example.org/data/stations" } } ]
Use schema.org relatedLink with a LinkRole value, and the link URL in a 'target' EntryPoint object. These properties expect WebPage and Action as their domain, so the schema.org validator will throw a warning (not an error). Related resource links are useful for evaluation and use of data, but because of the wide variety of relationship possibilities, difficult to use in general search scenarios. Use a soft-type implementation, with a link relationship type using a schema:DefinedTerm, and a resolvable identifier for the relationship target.
Funding 0..* "funding" :
{ "@id": "URI for grant",
"@type": "MonetaryGrant",
"identifier": "grant id",
"name": "grant title",
"funder":
{ "@id": "ror for org",
"@type": "Organization",
"name": "org name",
"identifier": [ "other identifiers" ] } }
Use schema.org encoding and science on schema.org pattern. Other organization properties can be included in the funder/Organization.
Policies 0..* "publishingPrinciples": [ {"@type": "CreativeWork"}.... ] FDOF digitalObjectMutability, RDA digitalObjectPolicy, FDOF PersistencyPolicy. Policies related to maintenance, update, expected time to live.
Checksum 0..1 "distribution\": \[ { \"@type\": \"DataDownload\", \"spdx:checksum\": {
  "spdx:algorithm":"string",
   "spdx:checksumValue":"string" },.. }\...\]
A string value calculated from the content of the resource representation, used to test if content has been modified. No schema.org property, follow DCAT v3 adoption of [Software Package Data Exchange (SPDX)](https://spdx.org/rdf/terms/) property; The [spdx Checksum object](https://spdx.org/rdf/spdx-terms-v2.1/classes/Checksum___-238837136.html) has two properties: algorithm and checksumValue. The checksum is a property of each distribution/DataDownload.
Provenance for discovery is limited to documenting technology used in the creation of the dataset and documening other datasets (datasets) that were inputs to the content of the described resource.
Provenance (instruments, software etc.) |0..* "prov:wasGeneratedBy": { "@type": "prov:Activity", "prov:used": [ "nerc:collection/L05/current/134", "nerc:collection/B76/current/B7600031" ] },Identify sensors, instruments, platforms, software, algorithms etc. used in the creation of the described resource
Provenance (input datasets) |0..* "prov:wasDerivedFrom": [
"http://doi.org/10.547/347848", "http://doi.org/10.3578/h5ls", "http://doi.org/10.547/93578" ],
"
Quality information for discovery: A text statement documenting quality of the resource should be included in the sdo:description. If there are quality policies or certificates that apply, these should be specified in the sdo:policies. Quality measurement or assessment protocols that have an output result specific to this resource can be specified using dqv:hasQualityMeaurement
Quality measure0..*"dqv:hasQualityMeasurement": [
{ "@type": "dqv:QualityMeasurement",
 "dqv:isMeasurementOf":     "nerc:collection/L27/current/ARGO_QC",   "dqv:value": "good" },
{ "@type": "dqv:QualityMeasurement",   "dqv:isMeasurementOf":
   "imf:dsbb/2003/eng/dqaf.htm",   "dqv:value":
  "http://linkToASpecificQualityReport" }]
Quality assesment or measument conducted using procedure or protocol specified by the dqv:isMeasurementOf property, with result value specified in the dqv:value property. The result might be numeric, a categorical term, or a link to a document describing the quality assessment.

Service-based distribution#

An API builds on a basic communication protocol (e.g. HTTP) by defining functionality and formatting to enable providing the specific data a user requires. This might involve filtering, subsetting, or various transformations for e.g. schema mapping, aggregating or anonymizing data. The focus here is on Web APIs that provide data using a URL for the endpoint location (the server that implements the data access protocol), with parameters to specify the particular data requested. The query parameters might be appended to this base URL as part of the URL, or provided as a message with the request. The implementation is based on the schema.org Action patterns, and the WebAPI is added as as a type for the value of sdo:distribution, analogous to dcat:accessService/dcat:DataService.

Implementation of metadata to describe a service-based (API) distribution:

CDIF content item

Obl.

Schema.org implementation

Scope note

Service type

1

“distribution”/”WebAPI”/
”serviceType”: “string”

specify the kind of service. Ideally this should be a resolvable identifier. Currently there is no widely adopted registry for serviceType identifiers, in large part because services might be defined at different levels of granularity, and classifications might focus on function, data formats, thematic content, security, or other aspects of the service definition. For interoperability, there must be an external arrangement between data providers and consumers on the strings that will be used to specify service types.

Service description document

0..1

“distribution”/”WebAPI”/”documentation”: “string” OR CreativeWork

document that provides a machine-actionable description of a service instance. Examples include OpenAPI documents, OGC Capabilities documents. Software designed to utilise a particular service type will typically include functionality to parse such a description document and engage with the service endpoint.

Endpoint URL

1

“distribution”/”WebAPI”/
”potentialAction”//”target”//
”urlTemplate”

Web location to invoke service; if there are parameters on the URL, the URL temple construct enables description of the parameters

Access constraints

1

“distribution”/”WebAPI”/”termsOfService”:
”string” OR CreativeWork

Description of access privileges required to use the API, e.g. registration, licensing, payments. Note that access constraints applying to any distribution of the resource should be specified in the access constraints for the resource description as a whole.

Implementation patterns#

  • DefinedTerm. {label, schemename, conceptURI, schemeURI}. This is a pattern used for property values that are concepts defined in a controlled vocabulary, ontology, or similar semantic artefact. Values have a label, which is a string that will be meaningful to a human user, a ‘schemename’, which is a label that similarly identifies the source semantic resource in which the concept is defined, the conceptURI is a globally unique,resolvable identifier forthe concept value; schemeURI is a globally unique identifier for the semantic resource in which the concept is defined.

  • Identifier. Identifiers can be inserted as simple string literals. If the identifier can be provided as a string literal that is resolvable and for which the identifier scheme is evident, that all that is required. If the identifier scheme is not well known, or the address of a separate resolve must be used to resolve the identifier, use the schema.org PropertyValue to provide additional information. The propertyID specifies the identifier scheme. CDIF recommends using scheme identifiers from https://registry.identifiers.org/registry/. The sdo:value provides the identifier as a string value. If the identifier can be resolved on the web, the sdo:url provides a resolvable URL.

  • Agent. This pattern is for specifying an Agent in the PROV sense: An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity. Agents can be persons, organizations, or software-defined actors. Agents have a name for human recognition, a type (Person, Organization), an identifier, contactPoint and affiliation. Machine agent contact points should be the accessible human who operations the environment running the machine agent. This pattern is used for hard-typed roles in the CDIF implementation- creator, maintainer, contributor, provider. Other roles can be documented using the schema.org role pattern in the sdo:contributor property.

  • DistributionObject {contentUrl, encodingFormat, dcterm:conformsTo, distributionAgent }. This pattern specifies information for implementing machine access to a DigitalObject. Includes a URL (contentUrl) for the web location at which the DigitalObject can be accessed, the specifications or profiles to which the serialization and content of the object conform using the Dublin Core conformsTo property, the format of the digital object content (sdo:encodingFormat), and the the Agent responsible for the distribution platform (provider). This agent is the contact point if there are problems accessing the distributed digitalObject.