We present you the main desired guarantees. For each business type of data, the anonymization process, which can be based on one or more anonymization techniques mentioned above, must be:
- Irreversible: this is mandatory for personal data, and in general, the process must be irreversible, otherwise the anonymization process is of no interest!
- Otherwise, reversible under conditions: sometimes it can be useful to be able to find the original information, but this must be avoided as far as possible as it can introduce an often unnecessary risk (a flaw). In any case, reversibility should only be possible for the data’s owner (this can be obtained by using encryption functions, which are reversible, and by keeping the secret key for example).
- Deterministic: makes it possible to guarantee that the anonymization of a given value (of a given type) will always correspond to the same anonymized value. This is useful if the anonymization process is carried out several times on data whose only some elements are subjected to change.
- Surjective: several values can be anonymized as a same output value, but in a deterministic way (example: “Pierre” and “Paul” will always be anonymized into “Jacques”). Bijectivity (which guarantees that each value will be anonymized as a distinct output value, also in a deterministic way) may seem more relevant, in order for example to guarantee the statistical characteristics of the data, but it is often difficult to obtain. Surjectivity is a good and usually sufficient compromise.
- Intelligible (human-readable): allows obtaining an anonymized value understandable by a human being. Some anonymization techniques produce an unintelligible output value (hash, encryption, random replacement, etc.), but it is possible to produce an intelligible anonymized value by combining them with another technique (such as substitution tables).
In the case of a database (the most common case), anonymization can be applied in different ways:
- In the source database itself: to be avoided, as the original information is lost and the anonymized data is directly accessible (via the application interfaces), without it going through an acceptance phase.
- On a copy of the source database, directly in the target environment. Also to be avoided, because non-anonymized information is accessible in the target environment, as long as anonymization is not carried out.
- On a copy of the source database ; this copy must be in a protected environment (since it will contain non-anonymized information) and then accepted before delivery to production or outsourcing. This is by far the best solution.
We will proceed in the same way to anonymize documents or any other source of information.
Business and technical constraints
We cannot end this overview of anonymization without taking into account the constraints that can be encountered. They are mainly of two types
- Technical constraints: these include, for example, referential integrity constraints in databases (so-called foreign keys) and other constraints (for example, in the case of relational databases, one can find constraints expressed by checks, certain triggers, at the stored procedures level, etc.),
- Business constraints: these are constraints that can be expressed at the database level (in this case they also become technical constraints), but which are sometimes only implemented at the level of the application using the database ; they may even not be formalized (they are followed by the users, without any verification at the application level). Example of business rule to be respected: the anonymized database must contain the same proportion of men/women as the source database, the average age must be the same, the customer identifiers must respect the in house construction rule, the amounts and sums must be realistic, currencies (or first names) must be consistent with the countries, etc.
As you can imagine, there is an infinite number of possible constraints, so we can only outline ways to solve the problems they pose:
- Technical constraints can generally be followed by anonymization tools and therefore by technical solutions. This is not always easy, but it depends above all on the complexity of the physical database model. The typical case, referential integrity constraints, can be temporarily disabled and then re-enabled at the end of the anonymization process. This of course implies that the anonymization rules follow these constraints. For example, if a (primary) key must be anonymized (made up of one or more fields), then it must be so wherever it is used as a foreign key.
- Business constraints are more complex to follow, because they are often not formalized or are impossible to retrieve or interpret from software:
- First and foremost, they must be identified and listed,
- Then, for each constraint, research how to anonymize the data while following the aforementioned constraint,
- Finally, implement the mechanism.
Example of using anonymization
To end this article, let’s take the first name « Paul »; after SHA-2 hashing of this first name, we obtain a completely unintelligible (but irreversible!) value:
Let’s suppose we have a table of first name substitutions:
If we take the last character of the hash of « Paul », that is to say 2, the table gives us the correspondence « Oliver », an anonymized yet intelligible first name (unlike the SHA-2 hash) .
Moreover, this value will be irreversible, even though it is not impossible to find another first name which will give the same anonymized value (the operation which we have just made is surjective). This does not matter however as the objective is ensuring that one can not determine that the first name “Paul” produced the value “Oliver”, even if knowing the technique used and the substitution table (and thanks to the cryptographic hash function, which is safe, it is indeed irreversible).
This type of algorithm can easily be adapted to other types of business data:
- For a date, you can hash the day, month and year and then use the last (or first, at your convenience) characters by transforming them into digits. Applying a modulo 31 on the day value, a modulo 12 on the month value and a modulo 2015 on the year value before checking whether the date is valid (for example, ensure that there are no such dates as 04/31, take into account leap years when processing dates set in February) and is included in a defined range (for example between 1900 and the current year) will allow the easy obtention of an anonymized, irreversible and realistic date.
- If for example we must maintain consistency between the date of birth and the age of each individual, one only needs to anonymize the date of birth (as above) before calculating the corresponding age (the age will therefore be implicitly anonymized).
- For a social security number (or credit card, or driver’s license, or bank account, etc.): these numbers follow fairly simple rules (value ranges for certain groups of digits, checksum at the end) which are easy to respect. A hash that is followed by value reduction operations as well as calculation of a valid checksum makes it possible to produce an anonymized number while following business constraints.
An anonymization project should not be underestimated or, even worse, neglected. The legal, economic, security and image consequences can be significant.
At present, there is no solution on the market which offers all the anonymization techniques (for example for digital documents AND database data), with all the desired guarantees (see above), complete customization of anonymization rules and the possibility of respecting business constraints. It is unlikely that such a solution will ever be available. In computer security, the worst is not knowing and being unable to verify the algorithms that are used (the anonymization mechanisms). Some solutions are free (or at least “open source”), but market leaders generally only offer closed solutions, the security of which cannot be verified.
This freedom is only rendered possible with the development of a specific tool (which can be based on generic data manipulation tools, such as ETL tools), resulting in costs and lead time. In addition, this implies possessing software development resources, along with actual cryptography and security skills ; this is not a given for all companies (mainly for cost reasons or simply because it is not their core business).
With an anonymization software solution found on the market, one has to weigh the licenses and training costs as well as the implementation times before being able to make a decision.