Is It Possible to Anonymize Open Data?

Lauréline Saux 30 October 2018 2 min read

Although data is highly useful, how do you protect the people to whom the personally identifying information belongs? Read on to learn more about whether it is indeed possible to anonymize data so that you can reap the value of open data.

While the benefits of open data have been clear for some time, there’s a major concern preventing more municipalities and agencies from utilizing it. That big worry is privacy.

The Current State of Data Anonymization

Data anonymization isn’t a new development. You’ll see it in use in any country that has a census. We’ll take US census data. Every ten years, the US government collects information from millions of citizens. It never reveals personally identifying data of individual respondents; rather, the government publishes its information in aggregate.

“Data anonymization isn’t a new development”

However, data gathered from individuals is quite valuable. Researchers have spent years trying to figure out how to preserve anonymity (and by extension, privacy) while still utilizing data gathered at the micro level.

Anonymization Techniques in Use Today

There are a few data anonymization techniques in use today that are effective at protecting people’s identities. One technique is noise addition. The term “noise” refers to adding imprecise figures to the data on purpose. For example, a census taker might personally interview Jane Smith, who is 65 years old. When the data is published, her age appears in a range from 60 to 69 years.

“Noise addition refers to adding imprecise figures to the data on purpose”

A second method is substitution. As its name implies, substitution works by exchanging one identifying factor for another. We’ll go back to the example of Jane Smith – instead of listing her age as 65 years; it would appear as the color red (as it would with anyone else of the same age).

The third approach to data anonymization is differential privacy. This approach involves giving a third party access to an anonymized data set, while the organization that gathered the information maintains the original set. Noise addition and substitution are generally applied when differential privacy is used.

“Researchers have determined that donut geomasking provides a higher level of privacy protection”

A fourth means of de-identifying data is known as geomasking. It’s the opposite of geocoding, in which street addresses are matched to map coordinates. There’s more than one method of geomasking, but “donut” geomasking (in which each geocoded address is relocated in a random direction by a minimum distance from its original location, but isn’t relocated more than a certain maximum distance). Researchers have determined that donut geomasking consistently provides a higher level of privacy protection in comparison to other techniques.

Open Data and Privacy: Not Mutually Exclusive

When it comes to turning people’s personal information into open data, concerns about privacy are perfectly valid. Putting precautions in place, such as donut geomasking, ensure that identifying data remains safe and secure.

You can use open data responsibly as well as derive the greatest possible value from it while still protecting individual privacy. To learn more about how that’s possible, contact us.

Share this post:

Articles on the same topic:

Digital transformation

About the author