Federated analytics

Federated analytics#

Federated analytics is an approach to reuse data without moving the data from the place where it is stored, allowing reuse of sensitive data in its original format. This is a privacy-by-design approach designed to prevent the reverse engineering of individual data centres, sets, and subjects. This is also a useful approach when data are too large to move.

Federated analytics enables real-time use of a living dataset, in contrast to approaches where a time-fixed version of a dataset is submitted to a repository or trusted research environment and subsequently shared. It is an important tool for analysis of time-varying sensitive data, for example in the health data space. Federated analytics can allow participant-level data to be analyzed across hospitals even though data can not be transferred from one hospital to another because of local data protection legislation. This is predicated on interoperable, high quality data described by interoperable, high quality, rich metadata.

Federated analytics generally require two levels of permissions. The first is the level of the research question and initiative, i.e., this specific group of data assets can participate in initiatives to answer a given research question. The second level specifies access conditions at the individual asset level. Although humans are engaged in assigning permissions for data assets, federated analytics is dependent on machine-to-machine communication of the types of actions possible for the data assets that are queried in the federated analysis. The data user is a software agent, and machine-actionable access policies are required during a machine-to-machine transaction to verify the identity of the agent, allow access to the target data asset(s), and check that the actions performed are those that are allowed.

Federated analysis of sensitive data needs to be accompanied by machine-actionable permissions for common work flows to allow machines to authorize actions on data assets based on rules defined for the specific interactions contributing to the analysis. The current ODRL 2.2 list of Actions for Rules does not include any Actions relevant to analytics. This Scenario highlights the need for extending this vocabulary with actions relevant to federated learning or other analytical processes. ODRL coupled with a structured, machine-readable representation for workflow execution is a necessary part of federated approaches to data reuse. An extended typology of Actions needs to be developed for distributed analytics (as part of a CDIF profile).

Data access policies and rules might constrain software Agents to accessing only parts of a dataset for particular users or analytic workflows. For example to only query a limited set of variables. ODRL permissions and prohibitions need to consider the processing involved in data analysis and model building, as well as privacy preservation concerns as part of the policy action. This requires Policy makers to specify sub-sets or sub-structures of their data Assets with interoperable, machine-actionable representations. In a federated learning approach to data sharing, refinement of Asset descriptions, e.g. using the DDI-CDI variable model, is necessary to support these more granular access constraints for distributed analytics.