The pace of change in AI technologies has been rapid, and there are many questions about how traditional data and metadata management approaches will be impacted. Large language models (LLMs) require huge amounts of data for training purposes, and the quantitative data used for research is a significant part of that. Unlike more textual data, however, it is not always directly meaningful - a file full of quantitative measurements may not be useful when taken separately from the metadata which drescribes the measurements. (i.e, “What is in the column labelled ‘Signif. Quant.’ and how is it defined?” does not fuly explain what the numbers in that column actually are, even if the header is understood to mean “significant quantity”, and so on.)
In order to provide data to LLMs in an optimal way, the needed metadata must be provided in a way which allows them to benefit fully from what is known about the data and provided to human users. In order to do this, ML Commons has developed a specification known as Croissant ML. The 1.0 version of the specification did not provide a great deal of descriptive metadata regarding the contents of files, but this has been expanded significantly in version 1.1. Members of the CDIF WG are also active within the Croissant community, to make sure that the approaches used in providing metadata to LLMs are aligned with those for other FAIR purposes. Consequently, there is a high degree of complimentarity between these initiatives. The goal from the CDIF perspective is to ensure the existence of what we term “Semantic Croissant” - a set of metadata for quantitative data, provided as a rich knowledge graph optimized for consumption by AI agents. The goal is to ensure that CDIF wil always be capable of providing the basis for the provision of high-quality Croissant metadata for consumption by AI agents.
The emergence of generative AI as a huge force for change in how data is consumed and used has potentially negative effects for data producers, however, and these must also be considered. When it comes to data access, there is insufficient standardization today to ensure that AI agents have the needed information to behave responsibly with the data they consume. Some data providers have reacted to this by removing their data from the Web altogether, so that it is no longer directly available to any intelligences, human or artificial. The situation is problematic, and can only be solved by having a standard expression of conditions of use and licensing, couched in terms which can be read and understood by both humans and machines. Further, there needs to be some mechanism for enforcement, such that access is practically controlled by the owners and stewards of data.
Toward this end, the CDIF WG has been exploring how different systems of access, based on ODRL, and other mechanisms can be combined. This has raised interest in the use of Decentralized Identifiers (DIDs). This remains an area of focus.
AI topics go beyond what is discussed here: generative AI tools can be extremely helpful in the production of needed metadata and documentation, so long as they are applied correctly, and can also play a significant role in harmonization of concepts and the integration and analysis of data. As it is further developed, CDIF will be looking at how such applications impact the core set of metadata needed for FAIR exchange of data across domains and infrastructures.