Talking Tech with Josh Kroll on The NIST Privacy Framework
CDT Tech Talks | 2024-07-19 | 28:05
New internet-based technologies have boomed with unprecedented access to data and data management tools. While this has facilitated innovation, it has also left many personal users and companies alike with limited knowledge about the uses and potential harms of their data. Balancing innovation and data privacy often requires tailored approaches, which is what the National Institute of Standards and Technology, more commonly known as NIST, attempted to address with their now highly-relied upon voluntary Privacy Framework, which offers guidance for organizations to voluntarily implement to protect data privacy and security.
Top Keywords
- data 0.017
- nist 0.015
- privacy 0.014
- framework 0.013
- risk 0.012
- inferences 0.011
- information 0.010
- risk management 0.009
- privacy framework 0.009
- netflix 0.008
- incidents 0.007
- might 0.007
Transcript
Speaker 0
0:10 – 0:12
Welcome to Tech Talk by
Speaker 1
0:13 – 1:23
CTT. Welcome to CDT's Tech Talk. We'll be just showing tech and Internet policy while also explaining what these policies mean to our daily lives. I'm Jamal Magby, and it's time to talk tech. New Internet based technologies have boomed with unprecedented access to data and data management tools. While this has facilitated innovation, it has also left many personal users and companies alike with limited knowledge about the uses and potential harms of their data. Balancing innovation and data privacy often requires tailored approaches, which is what the National Institute of Standards and Technology, more commonly known as NIST, attempted to address with their now highly relied upon voluntary privacy framework, which offers guidance for organizations to voluntarily implement to protect data privacy and security. Here to talk about this NIST privacy framework and its potential shortcomings in addressing data inferences is Joshua Crow, CDT nonresident fellow and assistant professor of computer science at the Naval Postgraduate School. Josh, welcome to the show. We're glad to have you. Thanks, Jamal. Glad to be here. So to get us started, what is the NIST privacy framework, and why is it important?
Speaker 0
1:24 – 3:04
The NIST privacy framework is one of a number of risk management frameworks that the National Institute of Standards and Technology or NIST has produced for understanding, risks in computer systems. So for a long time, NIST was in the business of creating cybersecurity standards that apply to US government agencies, and they still do that. These days, probably the most important of those standards is the NIST cybersecurity framework, which is a risk management framework that describes how to assess cybersecurity posture for an organization. And the NIST Cybersecurity Framework is required for use, by policy by all US government agencies. And it is adopted as well by a large number of private companies. I mentioned the cybersecurity framework because the privacy framework is structurally and content wise very similar to and very close to the cybersecurity, risk management framework or sometimes called the RMF by people who use it a lot. But the privacy framework, is in addition to being one of among NIST's risk management frameworks for computing is one of a handful of privacy engineering frameworks. And those are not, I think, as widely used as the cybersecurity frameworks are, but they're an important tool for organizing thoughts around how to do privacy and data protection work, and understand organizational risk management.
Speaker 1
3:04 – 3:15
Your work is focused on data inferences. What are they and what does the NIST privacy framework have to say about them? For many years now, it's been the case
Speaker 0
3:15 – 10:31
that data scientists have been able to determine how to use data in a rich context to understand linkages between data items that might seem at first blush to be unrelated to each other. So one good historical example of this is about location data. So it might be simple to think that there isn't anything particularly personal about the location data, where you go, the places, that show up, let's say, on your phone, where you are traveling, your phone captures location data in a variety of ways. Your phone has a GPS chip in it, and the GPS information can be made available to apps. That's how your maps app works, but also the phone is in contact with the cell network and the cell network knows where its towers are and which tower you're connected to. And also your phone is probably in contact with more than one cell tower at once. And if your phone is in contact with, let's say two or three cell towers, it can get a pretty good The network can get a pretty good guess at where you're located. And some of that information is available to the phone. Some of it is only available to the network operator. So you might think, well, none of that is particularly personal. It just saying that someone is there or that a phone with a particular hardware ID is there, and and it would take identifying that hardware ID, to link it back to me personally. The truth is, and there's a famous paper called Unique in the Crowd that makes this argument, that, the number of points in someone's location history necessary to get a reasonably high probability that you've singled out a specific individual person that's just three points. And the intuition behind that is the number of people who live where you live and work where you work is very small, even if you live in a big apartment building and work in a big, building with a lot of other people. You know, the number of people who live in the same place and work in the same place is you as small and you spend most of your time either at home, asleep, or at work during the day. And the result of that is if you pick two points at random from someone's location history, they're very likely to either be the person's home or the person's workplace. And then the third point provides just sort of additional certainty or additional differentiation if you do happen to have two people who live and work in the same place and you would confuse them or you happen to draw the same home location or the same work location twice, in your first random choice choices. When we talk about data inferences, we mean both that kind of data linkage or that kind of singling out and association of different kinds of data together. So you might think that by erasing names or other identifying information from data, you would make it impossible to learn other things from the same data. And that kind of thing has been shown for a long time not to be true. You know, there was a time when people were concerned about the privacy of certain healthcare data in Massachusetts. And to demonstrate the safety of the availability of that data, the governor at the time, William Weld, made a point of saying that he would release not just a bunch of state worker healthcare records, but also make sure that his own record was in there. And, a graduate student by the name of Latanya Sweeney very famously did some analysis to determine that the, if you had the zip code, age and sex of a person, which was available in the records that were released, that was enough to guarantee that the number of people you had to distinguish between was was very small, you know, two or three, I think at most, maybe that those three may have been made made people unique. I can't remember the the details, but it was certainly enough to know, you know, because the governor's age was known publicly and the Governor's Mansion ZIP code was known publicly and she knew that the governor was a man, then, that, was enough to narrow it down to, just a few records. And of those records, only one had treatment at a hospital at a day that the governor was known to have visited a particular hospital for some sort of documented, reported in the media, ailment, a broken arm, I think. And from that, I was able to say this is the governor's health record with high certainty, and sent it off to the governor's office and said this is the governor's health record and you should take the privacy of these record releases more seriously. For a long time after that, people thought, oh, this is the sort of thing that, you know, can be done by smart determined grad students or smart determined data analysts, but there have also been studies that identify or reassociate information across contexts at scale, so for large groups of people. And first and and still one of the most famous examples of that is a study that looked at data that were released by Netflix, which were designed, or meant to help support a competition where Netflix would seek the improvement of its movie recommendation algorithm. So it was people's previous movies they had watched, and the goal was to recommend additional movies for them to watch based on that. And researchers discovered that if you had watched particular movies on Netflix, it was likely that you might have also done something like reviewed those same movies on IMDb. And by comparing the relationship between IMDb usernames and the movies that had been reviewed or starred and the watch histories of these, unnamed Netflix users, they were able to with high certainty correlate the IMDb usernames to the Netflix users. And sometimes that's innocuous, right? Some people might use their IMDb username might be a pseudonym, might not have anything to do with their normal name, they might take great pains to separate their persona on IMDb from their real life persona. Some people may use their real names as their IMDb usernames and now their entire Netflix rental history, at the time that Netflix just did, DVD rentals at the time would be released and that's not maybe necessarily all that sensitive, it does happen to violate a law called the Video Rental Privacy Protection Act though, and Netflix probably didn't think when they released these de identified records for their recommendation competition that they were in violation of this law, because if they thought they were in violation of what they probably wouldn't have released the data.
Speaker 1
10:31 – 10:37
Can you walk us through how your team evaluated whether the NIST framework is sufficient for addressing these inferences?
Speaker 0
10:38 – 16:56
Yeah. So one one of the things we we thought was if the NIST privacy framework really is acting as a risk management framework that helps organizations understand privacy risk better, and focus their attention on reducing privacy risks, you know, say Netflix making the decision to release de identified records for the purpose of the competition, not recognizing that this put them at risk of violating the Video Rental Privacy Protection Act, then it would be good if a tool that says, hey, this is a risk management framework that helps you reduce your privacy risk that it actually does that. So what we did is we picked some well known, well studied incidents that map, onto these sorts of of examples that I've given. And we picked four incidents based on their similarity to, broad categories of incidents that exist. If you look at the history of incidents in these kinds of, primacy problems that have come up over time, we looked at four incidents based on two classes of inference that we decided you can taxonomize inferences into inferences that re identify individuals. So the thing you're inferring is an actual identity, like a name or, an identifier, something that would count as personally identifiable information, or, what we call operational inferences, which are inferences about organizations or or properties of groups of people. And for the re identification inferences, we looked at an incident where an online education platform called edX that's jointly run by Harvard and MIT released some data about student performance in the, online education programs. And people in this case, steps had been taken to make it difficult to recover the identities of the students. But nonetheless data analysts were able to recover a large number of student identities from the data. We also looked at a release of data under a Freedom of Information Law request. The city of New York released the history of taxi rides in New York City, the NYC Taxi Limousine Commission was obligated under the Freedom of Information Law to release this information and they again took steps to reduce the ability of analysts to correlate those rides together or, re identify people. But analysts were again able to make a whole bunch of sensitive inferences about where people lived and what other rides they had taken and whether a specific individual customer had or a person living in a particular location had, you know, visited businesses that that might be unseemly in certain contexts or, maybe were, at other residential addresses, you know, at times of night when it wouldn't make sense for them to, to be there as the place they lived. So all sorts of inferences that people might find sensitive or might at least believe that they're not revealing by, let's say, taking a taxi and paying for it in cash. And for, from the operational inference side, we looked at a couple of other, incidents where the harms are more inferences about what an organization is doing. So things that might ordinarily fall under the heading of operational security. A big kind of famous example of this is, there's, an app called Map My Run, which takes fitness tracker data and shows you where you ran and you can share this map with, people on a website to demonstrate, hey, I ran this cool route, I, you know, this is my running route and you can get rewarded for, you know, running every day or a certain number of days a week or, a certain amount of miles per month or year. When you take lots of people's data and aggregate it together over that running app, you get a situation where that aggregate data reveals patterns of, let's say one of the popular running routes in your city. And if you live in a big city, that that's maybe interesting, right? You might look at this map and you might say, oh, you know, I wanna find a running route in this new city I've never been to. This is the popular one. That must be a good one. I'll I'll try that out. On the other hand, if you look at this, let's say in a war zone in, Afghanistan, what you might find is the popular running route at Bagram Air Force Base is along the edges of the runways and that the map of the popular running route gives a high fidelity picture of where the edges of the runways are, which is a dangerous thing to reveal or a thing that might be protected in certain contexts, in a formal way. Right? There might be a security policy or an operational security policy that says, hey, don't give this information away. And a similar, maybe similar operational inference we looked at was a story in which, an operator of several franchises of, pizza delivery restaurants in the Washington DC area said that he had what he called the Washington pizza index, where he could tell, how negotiations were going in Congress based on how many pizza orders were being delivered to Congress, buildings at certain hours, or, you know, whether the Pentagon was preparing for some sort of military operation by looking at the number of pizza orders, at odd hours, delivered to buildings, around the Pentagon. And when you think about these things, they're not terribly surprising. And so what you would hope is that something like the NIST framework would identify these risks and attend focus at the organizational level to their existence and also what needs to be done to mitigate the problem or to limit the, you know, control the amount of data or the kind of data that's flowing out of an organization or away from, an individual, who, you know, might want to think about how they're protected or or not protected by by data protection laws. Continuing down this road, I I would love to hear what specific issues within this framework,
Speaker 1
16:57 – 16:59
you isolated using those case studies.
Speaker 0
16:59 – 20:48
Yeah. So one of the things that we find in this study is when you apply the NIST framework to these four incidents, you don't actually learn a lot about the risk of the inferences that later came out. I should have mentioned by the way that we, we chose incidents that all happened prior to the, creation of the NIST framework, such that none of the organizations involved could have used the NIST framework to think about their risk profile. They had to think about their risk profile in an ad hoc way or in a proprietary way internal to their own organization. So the NIST framework does not really identify any of these risks that that later came to pass, suggesting that it's not doing a great job of capturing that risk information. And from my perspective, that's a that's a problem because what we would like is for something that stands up as a risk management or a threat modeling framework that it actually performs that function, or helps us do that activity in a way that makes us better at it. And, if we apply it to these four incidents, we find that it doesn't really capture much of the information that was later found through other means. And we think probably it should. And I would argue that the reasons for this are that because the privacy framework is very structurally and content wise similar to the cybersecurity framework, its focus is really on who has access to what information, but inferences sort of circumvent those controls. So in the case of the Netflix prize or the edX dataset release or the New York Taxi and Limousine Commission dataset release, all of those were datasets that were released to the public. And so when you do that, you say, okay, well, this data is out there, it's supposed to provide benefit to the world, right? If you release the data about historical rides in New York City, people might be able to learn about, you know, what are the popular, routes that lots of people take taxis on, and maybe that's a place that we should have a bus route and advocate for the city to, stand up a bus route because it would save a lot of taxi rides, which would save a lot of traffic in the city and that would be a better way to organize transit. And so there are civil society groups that might actually care very much about having that data, and there might be a lot of value in having that data. But when you release it, you don't expect it to be the case that you're also releasing a lot of sensitive data about people's personal lives. And in fact, later on, people discovered actually you are doing that, and so it would be nice if we had a tool that helped, but just to track back to what I was saying a moment ago, the reason that privacy risk management framework does not in fact identify this risk, we hypothesize is that it's very much focused on following a security like policy, right, who has access to the data, is the access controlled sufficiently. But if I release data to the public, then I'm not really controlling access to that data anymore. And there are techniques and methods that would allow you to make claims about what information you remove and whether you should add noise and at what level, you should add noise. And those techniques are not easy to apply by any means, but also if they exist and the framework is supposed to help you manage risk, then it should probably help you think through that set of trade offs or at least identify that the trade offs exist. And, in its current version, it doesn't do that. And, as this moves toward a a revision of the privacy framework from version one point o to version 1.1, I would sort of hope that they would consider how it might identify these things or not claim that the framework is a sort of general purpose privacy risk calculus
Speaker 1
20:48 – 21:00
tool. I I wanna ask, how can practical risk management and data protection policy better connect to privacy goal? Yeah. So that I I think that I'll go back to the last thing I said, which is,
Speaker 0
21:00 – 26:35
if we want to claim that something is a tool for managing risk, then we should want it to at least identify risks and maybe recommend good ways to reduce those risks. I think the other aspect of this that's important is to move away from the idea that we can protect data by redacting certain quote unquote identifiable information. So lots of data contain, you know, both information that we probably agree is is not sensitive, you know, someone watched Flight of the Condor, on a specific day, well lots of people probably watch flight of the condor on that day, but the linkage between that information and an identifier, you know, user one two three four five watch flight of the condor on June 5 is more sensitive, not so much because you know the specific date or because you know the specific user information, but because you can start to link that information together with other things. And so by redacting portions of the database, you make it harder for an analyst to make these linkages. And there are ways of of quantifying that difficulty and making it, stronger and more scientifically coherent claims about how hard it is for an analyst to make claims or what it would take to make it, impossible for an analyst to make certain kinds of linkages. But at present data protection law and data protection policy and, as a result, company policies and things like the privacy policies that you see when you visit a website or when you sign up for some sort of service will tend to separate information into personally identifiable information like names and addresses and IP addresses and non personally identifiable information like the movies you watched. Nonetheless, in some of these incidents, we see the non sensitive information is actually enough to single out an individual, right? There are no, there's no one else in the world who's watched the an individual, right? There are no, there's no one else in the world who's watched the exact same set of movies that you have, and so if you were to write down all of the movies you've ever seen, no one else would have that exact same list. And what that means is that list is identifying information for you. We may know that it singles out an individual, but not be able to figure out who that individual is, but over time using information that may not even be in the data, right? We don't have to use a good in the Netflix example, people didn't use only data from what Netflix released. They also used data they were able to gather from the public Internet from IMDb. But unless you really know all the other information that's out there, it's hard to make a judgment. And so when we decide that it's good enough to protect privacy by removing personal identifiable information, then we're sort of handicapping our ability to think about risk management in a holistic way. To close this out, any final thoughts? Yeah. Just to to expand a little bit on on the last point, to manage risk in a holistic way, a thing that I think we see is not just brittleness from the use of personally identifiable information, but brittleness from not thinking in terms of more holistic approach to privacy protection. And what I mean by that is when you think about data protection compliance or privacy compliance as a box checking activity, where there are specific requirements that you have to meet, you know, you must delete these fields from the database, and then any other use after that is fine, then you move toward a situation where what happens is your posture ends up being brittle. So colleague of mine recently wrote, published a paper looking at how messaging apps, encrypted messaging apps use third party services to create notifications on your phone, and if you send the content of the message through the third party service to generate the notification, because you want the content to show up in the notification, then that's easy to do, but it also reveals the content of the message to the third party service. But the whole point of someone using the secure messaging app was to avoid anyone being able to learn the content of their message except inside the phone of the recipient. And there are secure ways to do this or ways to do this that protect the privacy of that information, But if you think of the problem as being a problem of just tightly tying information down to the places it's allowed to go and you decide what the third party messaging service is really like an extension of my own infrastructure, and I don't need to think of it as separate, and I don't need to worry about information I'm putting into it as you know, being disclosed to someone else, then your view of how much risk you're taking on, in your organisation or how much risk you're pushing out to the general public who use a product is gonna be wrong and that's, I would say problematic because we all would like to live in a world where we can treat these tools as safe and believe the representations that are made, let's say in the privacy policies or in the marketing materials, if an app says we're a secure messaging app, and the messages you give us will not be displayed to anyone but the recipients, then that should be how the app behaves. I completely agree.
Speaker 1
26:36 – 26:45
Well, Josh, it's been a pleasure having you today. Thank you so much for joining us, and, you have to come back when version 1.1 comes out.
Speaker 0
26:47 – 27:41
Yeah. Well, I play and then there are several other privacy, risk management and threat modeling tools, and we studied the NIST framework because at the time we did this work or we started this work, the other tools weren't quite as well developed as they are now, but there's one that is increasingly widely used in the privacy research community called Lundin, there's also one from the MITRE corporation, that seems to be gaining some interest that we could certainly repeat this study using those tools or something like it, to understand, comparatively how well they work. And there are probably also other ways to, explore the question of whether something that measures risk is in fact measuring
Speaker 1
27:42 – 28:04
risk correctly. So that's that's the thing we we think about. Well, we look forward to it. Thank you again for for coming. And for all of our listeners, if you'd like to keep up with any of the work that CDT is doing, please visit us @cdc.org and follow us on Facebook, Mastodon, LinkedIn, and the social media platform formerly known as Twitter at SendDevTech. Thank you all for talking tech. Talk soon.