The following article written by Woodway’s CEO and Founder, Khaled El Emam, was originally published by OneTrust Data Guidance.
Summary
Key Takeaways
- Anonymization Guidance: The Ontario IPC provides detailed guidelines for anonymization, addressing uncertainties in data use for AI.
- Public vs Non-Public Data: Differentiates between public and non-public data releases, affecting risk assessment and safeguards.
- Re-identification Risk: Risk assessment should consider the recipient's controls and capabilities, following a relative approach.
- Quantitative Methods: Sophisticated methods for evaluating re-identification risk are necessary and should be automated.
- Inference Assessment: Updated interpretation of inferences helps quantify and compare risks to determine acceptability.
Things to Consider
- Implement structured anonymizaton practices: Organizations should adopt the IPC guidelines to ensure responsible data use and compliance with regulatory expectations.
The Ontario IPC provides detailed guidelines for anonymization, addressing uncertainties in data use for AI and emphasizing structured risk assessment.
Access to data is critical for training artificial intelligence (AI) models, and such access has taken on a new urgency as organizations mobilize to leverage AI technologies and remain competitive. At the same time, many organizations face a practical problem: They want to use data for AI and other secondary purposes, but they are uncertain about what anonymization requires in practice, how to assess risk, and what regulators expect. Given that it is not always known how AI models will be trained and used, prior and current consents did not and do not capture the specific purposes for using these data. This makes the concept of anonymization a critical one to enable the use and disclosure of these data for secondary purposes.
Recent guidelines published by the Office of the Ontario Information and Privacy Commissioner (IPC) provide a detailed process for anonymization (called ‘de-identification’ in the IPC guidance). This guidance also directly addresses some of the uncertainties that have troubled anonymization implementation efforts.
For privacy, data protection, and compliance professionals, it offers a practical and defensible basis for making decisions about data use and sharing.
In this Insight article, Dr. Khaled El Emam, Founder & CEO of EviData by Woodway Assurance, describes these uncertainties and how the IPC guidance addresses them. By reducing such uncertainties, it should be easier for organizations to implement best anonymization practices and have greater confidence that they will meet regulatory expectations.
Why public and non-public data releases must be treated differently
As will become clear in the rest of this article, it is important to distinguish between data that is being released publicly and data that is being shared in a non-public manner. For example, when an organization reuses its own anonymized data or shares anonymized data with an external business partner or with an academic institution, these would be examples of non-public data releases.
In non-public contexts, the data recipient is known. It is also possible to have contractual controls, such as a data sharing agreement between the custodian and the recipient. The agreement can prescribe specific security and privacy controls that the recipient needs to put in place to manage any residual re-identification risks. So a re-identification risk assessment does not need to assume an unknown recipient or an absence of controls.
Re-identification risk depends on the anticipated recipient
The IPC guidance makes clear that the risk of re-identification should be assessed from the perspective of the data recipient. This means that the analyst should take into account the information to which the recipient has access, as well as the recipient's controls and capabilities. This also means that for any specific dataset, the risk of re-identification may be different depending on who the recipient is, and therefore, the risk is a function of the data and the characteristics of the recipient. In other words, risk cannot be assessed in the abstract – it must be assessed in context.
This approach is not new and is part of the Expert Determination de-identification method in the Health Insurance Portability and Accountability Act (HIPAA) in the US, which has been used for more than two decades and represents well-established practices. Furthermore, a significant amount of knowledge and expertise has been accumulated over that time on how to perform these types of assessments efficiently and at scale. The IPC guidance provides a methodology for doing exactly that. This is also sometimes referred to as the ‘relative’ approach to re-identification risk assessment, as opposed to the ‘absolute’ approach.
In the situation where data is released publicly, for example, in the context of open data or open government initiatives, the recipient is not known in advance. Here, more conservative assumptions can be made about the data recipient by assuming that they do not have meaningful security and privacy controls in place and that they would not be expected to adhere to contractual provisions or terms of service. This is a scenario where one can take a more absolute approach to risk assessment. But as noted above, it is important to make a distinction between public and non-public data releases and data reuse, as the parameters are quite different. For organizations, this can affect both how risk is assessed and what safeguards may be relevant.
Risk thresholds are well established
As data complexity increases, with multiple datasets linked and integrated, more sophisticated quantitative methods for evaluating re-identification risk are needed. Simple approaches to anonymization, such as lists of variables to remove or simple heuristics, may have been reasonable in the past, but they are difficult to justify in today's environment. They either are not protective enough or are too conservative and have a large negative impact on the value of the data.
There is a large body of quantitative techniques that have been developed over the years to assess re-identification risk, and these have become quite accurate at modeling and measuring risk over time.
These quantitative methods are not intended to be applied manually as they require extensive computations and therefore need to be automated. The level of automation is quite high nowadays, and therefore, there is no reason not to apply these methods. This is especially important for organizations trying to assess risk consistently and at scale.
Once quantitative methods are applied, the next question is what the acceptable level of risk in the data would be. The IPC guidance provides very precise thresholds for acceptable risk, and these are consistent with international standards and values used in North America and Europe for data sharing.
Furthermore, the manner in which these thresholds are calculated will depend on whether the release is public or non-public. Therefore, there is a clear and prescriptive articulation of risk metrics and risk thresholds.
These thresholds may change over time - they do not have to be static. However, at this point in time, these guidelines give us unambiguous guidelines about what threshold values should be used. Whenever a reason for change emerges and the guidelines are updated, practices can be adjusted accordingly.
A more practical way to assess inferences
One of the more challenging concepts under the heading of anonymity has been the operationalization of the three criteria published by the Article 29 Working Party in 2014 in their Anonymization Techniques opinion: singling out, linkability, and inferences. The community has developed good practices and metrics to evaluate the former two concepts. However, the concept of inferences has been challenging, defined as “the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.” This definition describes almost all types of (non-univariate) data analyses that can be performed on a dataset. The implication being that the ability to perform useful analysis on a dataset means that it is personal information.
Over time, the concept of inferences has evolved, and its understanding has become more nuanced. The IPC guidance provides one updated interpretation, which states that if an adversary is able to learn the value of an attribute more accurately when a target individual is in the data compared to when they are not, then that would be an inference. This is an example of what is called attribute disclosure. It is defined in a relative manner, indicating that inferences are relative in that we should consider the incremental information gain from being in the data versus not being in the data. With this definition, a more practical interpretation of inferences emerges, and it becomes a concept that can be quantified and compared to a threshold to determine whether it is acceptable or not.
Conclusion
The IPC guidance covers more ground than what we had space to discuss in this article. However, we covered some of the key clarifications that have caused angst among privacy and technology professionals when implementing anonymization. The approaches in the guidance described are based on existing practices, standards, recommendations from the expert community, and lessons learned from white-hat re-identification attacks (known as motivated intruder attacks). They provide a practical approach to protecting individual privacy while still enabling access to useful data that drives innovation, especially in the context of training and validating AI models that can be beneficial to society.
Anonymization does not have to remain a matter of uncertainty or ad hoc judgment. The guidance points toward a more structured and operational approach, grounded in context, quantitative assessment, and clear thresholds. That, in turn, creates a stronger foundation for responsible data use and disclosure. As expectations evolve, organizations will increasingly need to apply – and demonstrate that they are applying – rigorous and scalable ways to assess risk, and new technologies are helping to make that possible.