Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Controlled Vocabularies: Codelists and Concept Schemes

Terminology-based semantic resources are a key element in information systems, establishing the binding between the symbols (strings) manipulated by computers and human-intelligible meaning of properties, types, values, or any other element in a volume of data. These are a critical component in scenarios involving (but not limited to) data integration and harmonisation.

For the purposes of this document, we use the term ‘controlled vocabularies’ to cover all of the related set of terminology-based semantic resources, even though this may not be technically exact (ontologies, for instance, are often seen as a different kind of resource). Here, our use of the term potentially includes codelists, classifications, thesaurus, taxonomies, glossaries, ontologies, etc. In places where we specifically mean ontologies, classifications, or other types, these terms are explicitly used.

We envision two primary scenarios for FAIR usage of controlled vocabularies. In the first scenario (data-centric) an agent (human or machine) encounters a term, code, or symbol in a data set and needs to understand the meaning of that symbol, or to determine if its meaning is the same as some other term, code, or symbol. The navigation is from the individual symbol (code, term) to its meaning within the system of which it is part. In the second scenario a vocabulary as a whole is published as a reusable resource for use outside the context of a particular field in a data set. Navigation is from the vocabulary resource into a component part(term).

Both scenarios require description of the ability to navigate between the controlled vocabulary as a managed whole and its component parts.

To meet these use cases, each member of the vocabulary must have its own globally, unique, web-resovable identifier. These unique identifiers are used by machines to detect when the same concept is being used in two different data sets.

Persistent, resolvable identifiers (PIDs) are required, with a globally unique mapping from the identifier to a concept. The identifier must be be resolvable on the Web to obtain a useful representation. How these identifier strings are formulated will vary widely across user communities. CDIF only recommends that they be included in the definition of a controlled vocabulary. The goal is that identified concepts can be reused to ease the burden of data harmonisation: if two data sets use the same concept, by referencing the same PID, then there is no ambiguity.

CDIF is recommending profiles for two kinds of controlled vocabulary resources: Codelists and Concept Schemes. A Codelist is a resource that maps short strings (codes) to meaning. At the simplest level meaning can be conveyed by another longer string--a ‘label’ that is more informative for users. Concept schemes are collections of information objects that represent concepts with a human-intelligible label, a definition that specifies the concept, and auxiliary information typically including relationships between concepts and information about the source of the definition. Codelists are intended for use in constructing user interfaces with pick lists for populating fields in datasets. Concept schemes are more broadly applicable to any situation in which the meaning of some information entity, e.g. class, property, property value, needs to be made clear to avoid misunderstanding in the interpretation of data. Requirements for the Codelist and Concept Scheme profile are as follows.

Codelist Requirements

Concept Scheme Requirements

Implementation

CDIF recommends the use of the Simple Knowledge Organisation System (SKOS) for representing concept vocabularies. SKOS is a RDF vocabulary that includes predicates to assign an identifier to a concept, provide a definition, and assign preferred, language-localized labels (strings) for human use to identify the concept. A vocabulary service exposing the SKOS content on the web is necesary to make the identifiers resolvable.

This use of SKOS materially aligns with that described in the document ‘Modelling of Eurostat’s Statistical Classifications in ShowVoc’ for classification items.

CDIF recommends following the guidance provided by Cox et al. (2021) ‘Ten Simple Rules for making a Vocabulary FAIR’. The CDIF recommendation to use SKOS (as described in this section) aligns with Rule 6 (Cox et al., 2021) regarding machine-readable formats for CVs.

Note on formal statistical classifications

Documentation of formal statistical classifications includes additional information, but a detailed profile for CDIF has not been developed. CDIF recommends using the style used at Eurostat and FAO. These descriptions include additonal properties, and can include tables documenting mapping between versions of classifications. This information is represented using XKOS, see the XKOS specification and user guide.

References
  1. Cox, S. J. D., Gonzalez-Beltran, A. N., Magagna, B., & Marinescu, M.-C. (2021). Ten simple rules for making a vocabulary FAIR. PLOS Computational Biology, 17(6), e1009041. 10.1371/journal.pcbi.1009041