Talking Tech with Rachel Cummings & Daniel Susser on Differential Privacy
CDT Tech Talks | 2024-04-04 | 29:49
In recent years, differential privacy has emerged as a promising solution for enhancing privacy protections in data processing systems. However, beneath its seemingly robust framework lie certain assumptions that, if left unquestioned, could inadvertently undermine its efficacy in safeguarding individual privacy.<br><br>Here to discuss their recent papers on differential privacy is Rachel Cummings, Associate Professor of Industrial Engineering and Operations Research at Columbia University and CDT Non-Resident Fellow and Daniel Susser, Associate Professor for the Department of Information Science at Cornell University and CDT Non-Resident Fellow.
Top Keywords
- privacy 0.045
- differential privacy 0.042
- differential 0.036
- data 0.012
- rachel 0.010
- epsilon 0.009
- guarantees 0.008
- mathematical 0.007
- parameter 0.007
- privacy guarantees 0.006
- different 0.005
- practice 0.005
Transcript
Speaker 0
0:10 – 0:12
Welcome to Tech Talk by
Speaker 1
0:13 – 0:13
CT.
Speaker 2
0:15 – 1:14
Welcome to CDT's Tech Talk, where we dish on tech and Internet policy while also explaining what these policies mean to our daily lives. I'm Jamal Magby, and it's time to talk tech. In recent years, differential privacy has emerged as a promising solution for enhancing privacy protections in data processing systems. However, beneath its seemingly robust framework lie certain assumptions that if left unquestioned, could inadvertently undermine its efficacy in safeguarding individual privacy. Here to discuss their recent papers on differential privacy is Rachel Cummings, associate professor of industrial engineering and operations research at Columbia University and CDT nonresident fellow, and Daniel Susser, associate professor for the Department of Information Science at Cornell University and also a CDT nonresident fellow. Rachel and Dan, thank you both for being here today. Thanks for having us. It's a pleasure. To kick us off, can you both explain exactly what differential privacy
Speaker 1
1:15 – 2:58
is? Yeah. I can jump in first. Rachel is the the technical expert here, but I'll say from a nontechnical vantage point, and without getting too into the weeds, I think of differential privacy as one of many relatively new privacy enhancing technologies, which is to say it's a technical approach to protecting sensitive personal information. Of course, like the most straightforward way to protect that information would be to just destroy it or encrypt it or otherwise ensure that no one could access it. I think what's important for people, especially non technical folks, to understand about differential privacy and related tools are that they're designed with a kind of dual mandate, on one hand to protect people's personal information while on the other hand, at the same time, allowing for data to be shared and analyzed. So differential privacy tools are trying to help us navigate these two competing goals of sharing information while simultaneously protecting privacy. And it's becoming very popular, both in government and industry. One, because it promises to help us do both of these things at once to allow data collectors to analyze personal data and learn from it, while also helping to protect data subjects from the risk of exposure. Two, it has very precise controls for quantifying and helping us balance the trade offs between these goals. And it offers very strong mathematical guarantees. So for all of these reasons, it's a really,
Speaker 0
2:59 – 4:01
interesting and useful set of technologies, but those are the things I think it's most important for, non experts to understand. Sure. I can add to that, and I will say, Dan, I'm very sure you two are a technical expert here. And, like, before I I like, I responded, I always like to warn all my listeners and my audience and talks, they do have a minor speech impairment. And so if you hear a brief pause of my response, it isn't a glitchy recording. I'm still here. I think it's important to say. So to that, I would like to add something about the nature of what are the types of privacy guarantees differential privacy gives. It ensures that if I learn something based on analysis of a dataset containing your data, I'm not learning about you. I'm learning about the entire population. And in particular, it, like, guarantees. If I performed the same analysis on the same dataset without you, I would have learned about the same thing. And so it's saying, I'm not learning from your data. I could have learned the same thing even if I had never seen your data.
Speaker 2
4:01 – 4:17
Daniel, your paper highlights certain assumptions associated with differential privacy that could potentially shield data collectors from criticism rather than protect data subjects from privacy harms. Could you delve into some of these assumptions and how they might impact privacy protection? Yeah.
Speaker 1
4:17 – 8:26
Differential privacy theorists usually describe it in more abstract terms than I did a minute ago. It's often described as a definition of privacy, which is to say it's like a way of conceptualizing privacy that is particularly amenable to formal mathematical analysis. As humanities and social science scholars often point out how we define problems often shapes the way that we try to solve them. And privacy is like a really rich and multifaceted concept in, you know, ethics and policy debates. It's connected with a whole range of goals and values, like giving people space to autonomously develop their own beliefs and ideas, protecting people from state or corporate interference in their decision making, shielding people from having information about them used in ways that could harm them. And what in our paper, Jeremy Seaman, a statistician at the University of Michigan, and I try to do is think through how differential privacy's particular understanding of or approach to privacy might frame the way people think about privacy. Which of the concerns about privacy it brings into view and tackles head on and which concerns it might place outside its frame and therefore might lead people to ignore. So for example, you know, differential privacy defines privacy in a very specific way. Sort of another gloss on what Rachel just said is differential privacy begins by saying like, look, there's no such thing as perfect anonymization. Any useful statistic is going to leak information about the dataset it describes. So if we're going to do statistical data analysis, if we are going to analyze people's personal information, then the question can't be, like, will this technology perfectly protect my information? Instead, the question has to be, how much additional risk of disclosure am I willing to tolerate by contributing my information to this particular dataset or database? And that's a really good question. It's a really important question to ask and an important problem to solve. But what we argue is that, this frames privacy considerations in a primarily forward looking direction. So it asks like, how much more risk am I willing to accept moving forward? And when it does that, it potentially, like, shifts attention away from other equally important privacy questions. So questions like, should this information have been collected in the first place? Was it collected appropriately? Should it be stored in this database subject to, you know, these particular parties control? Who decides who gets access to the data and on what terms? Whose interests does the existence of this database serve? All of these are questions that are sort of outside of differential privacy's frame. And if we sort of focus too much on the tool, we might not ask these other sort of broader privacy questions or give them as much attention. So differential privacy, we argue, if not implemented very carefully, could be used to shift attention away from harmful surveillance practices, for example, about why and how data is collected in the first place, which could shield data collectors from criticism. And unless we're really clear about this, it's possible that, you know, the adoption of differential privacy could lead to more data collection rather than less data collection. And so we should be really explicit about that and evaluate that possibility, collectively.
Speaker 2
8:27 – 8:33
And that leads me to my next question, over to Rachel. Your paper focuses on the challenges
Speaker 0
8:33 – 11:31
of communicating the implications of the privacy budget parameter to individuals contributing their data. How crucial is communication in avoiding privacy theater and enabling more informed data sharing decisions? That is an excellent question, and I'm so glad that you asked it because, in fact, it is really important. I think that what we are saying here like today is that, like, a differential privacy is great. It is a very strong guarantee. It provides guarantees against all kinds of adversaries, all kinds of bad things that might happen to your data. But there are some important nuances in it. And in particular, differential privacy has a privacy parameter. It's called epsilon, and it really controls how much information is leaked about the individual in this analysis. And that parameter can range anywhere between zero to infinity. And so it's really important to communicate what what is this privacy parameter. And so back to my sort of like original differential privacy description where I said, we're guaranteeing I'm gonna learn about the same thing if I had never seen your data, and it's like about the same instead of exactly the same, that's really where this epsilon parameter appears. And so if I'm saying, I'm gonna learn something very, very, very similar even if I had never seen your data, then it's saying, your data has sort of no impact, and so there's approximately nothing bad that can happen to you as a result of this. But on the other hand, if I say, I can learn something that's like that can be, like, arbitrarily close or far from what I would have learned even if I hadn't seen your data, well, now you're sort of opening the door for the possibility of some bad things or some harms coming as a result of analysis, as a result of sharing your data. And so, like, describing what the value of this epsilon parameter is is extremely important in terms of, like, a transparency of a differential private system. However, one challenge is that there is no intuitive explanation for what a privacy parameter value means. How should you feel if it's point five? How should you feel if it's one? We don't really have any good answer for this. And so this is a part of what my work is. It's sort of thinking about how do we translate between some, like, abstract mathematical parameter that has no units and no inherent meaning into something that really does mean mean, like, a privacy guarantees for people. And so I think, like, finding ways, first of all, to just, like, ensure companies report privacy parameters that they're using is important, to make sure if not infinity, I can do anything arbitrarily bad I want to your data. Then also on top of that, making sure that if you want to, like, communicate this, the people who are going to be sharing their data, we can't just tell them I chose epsilon equals one and therefore, you, should feel happy, but rather is that communicating what are the implications of this choice in terms of the other privacy concerns you might have? How does this parameter value relate to what you think of in your head as being privacy? We mentioned a mathematical equation or framework and Daniel, I want to come back to you because you introduced
Speaker 2
11:32 – 11:47
differential privacy as both a mathematical framework and a socio technical system illustrating differences between the two through a hypothetical case study. Can you discuss some of the key differences highlighted in the study and their implications for governing differential privacy systems?
Speaker 1
11:48 – 15:34
So as we've discussed, you know, differential privacy is not one concrete thing. It's an approach. It's a way of thinking about and mathematically tackling privacy problems. It comes in all sorts of varieties that rely on different mathematical strategies for realizing its goals. And importantly, it's a theoretical approach. To use it in practice, that theory has to be translated into real world systems. As Rachel was just describing, the math describes this variable called epsilon, which has to be set at some point in order to actually concretely translate the theory into practice. And whenever we move from theory to practice in any domain, certain things get shifted around or lost in the process. You know, assumptions that theorists make sometimes don't hold up or we end up with approximations rather than like perfect realizations of our plans. DP theorists know this, of course. One goal of of Jeremy's and my paper was to help non experts understand that just because a dataset is, quote, unquote, differentially privacy protected, that does not in itself mean very much. And to evaluate the real world protections a differentially private system offers, you need to know a lot about the implementation details. So as Rachel was just saying, for example, one thing you might wanna know about is what the value of epsilon is. So we in the paper, like, to illustrate this, describe a hypothetical scenario, I think, like, a fairly realistic but hypothetical scenario where researchers at a hospital want to study patient data. So to encourage patients to consent to contributing their data, the hospital assures them that their risk of exposure or harm from releasing their data will be minimal because differential privacy is going to be applied. But for a variety of reasons, it turns out at the end of our story that all the parties, end up unsatisfied. Applying differential privacy creates too much statistical noise for the researchers' findings to be well supported scientifically and the patients end up harmed in unexpected ways. And, you know, for the details, you'll have to read the paper. The whole paper sort of unspools, unspools the details. But our overarching point is that it's in this move from theory to practice, from, you know, what we call differential privacy as math to differential privacy as real world socio technical systems, where a lot of the important issues get decided, where like the real world meaning, as Rachel was saying, of differential privacy guarantees are actually determined. And that translation process involves a range of different parties with different interests embedded in different institutions. They often value and prioritize privacy and statistical data utility differently from one another. There are significant disparities of knowledge and expertise. So if we're going to rely on differential privacy and related technologies for our privacy, we need to pay really close attention to the process through which all of these different parties help translate theory into actual concrete systems to ensure that everyone's interests and values are
Speaker 2
15:34 – 15:56
recognized and and accounted for. I wanna move us forward just a bit. And this Rachel, this is to you. Your paper proposes three methods to convey probabilistic DP guarantees to end users, drawing on best practices and risk communication and usability. Can you provide an overview of these methods and how they contribute to a better user understanding of privacy guarantees?
Speaker 0
15:58 – 20:08
Absolutely. And this is really about, like, a translating between the mathematical guarantees and some abstract sense into what they really, like, concretely mean for the end users. So one thing is that it is just an observed fact empirically. Humans are pretty bad reasoning about, like, probabilities of events, especially rare events. So for one thing, if I were to tell you something will happen with, like, a probability probability 0.35, it's really hard to sort of intuit, is it likely, is it unlikely? But if I say there's a, like, a 35% chance of it of it happening, now we can relate that to you. We're like, okay. So let's say there's a, like, a 35% chance chance of rain. I'm used to thinking about that in my daily life. Even more, if I were to say, 35 out of a 100 times, something would happen. And these are all the same, and they're all, like, describing the same underlying mathematical thing, but there are are ways to make these probabilities much more concrete and much more, like, interpretable to, like, humans. Anything evolving, physical objects, visualization, for example, we all have a really good intuitive understanding of, like, a probability half because of a coin flip. We saw it in the, like, Super Bowl recently. But we have an understanding of that. We have an understanding of kind of, like, one out of six probability because we roll dice. And so tying a mathematical abstraction into something physical like a like a coin flip or a die roll or even a, like, a roulette wheel spinning makes it much more understandable because we're used to thinking about these things in our daily life. And so back to our, like, Epsilon question, how do we translate privacy guarantees into something interpretable? So we developed, three three different methods in our paper for explaining what are the implications of a privacy guarantee in a, like, a particular scenario. And I'll say first, they began the very, like, a concrete scenario. So we said imagine, you you are at work and your company is like doing a survey about your manager and if they're doing a good job, and you think they're they're not doing a good job and you want them to receive feedback about that, you're also, like, worried about the repercussions of this. And so sort of already, we are framing something concrete, and I understand why there's a value of the information sharing, but I also understand why there is a privacy need here. And then we tell them that the, like, our manager is going to receive, like, a differentiated privatized version of the responses, and then we can vary epsilon in different scenarios. And so we explained what is the impact of the epsilon parameter, which again is sort of, like, abstract mathematical in terms of, like, a terms that are important. How likely is it their manager will think you personally wrote they're doing a bad job? No one wants that. And so our three different methods, so our first one was using a text based based, like, a description of the odds. If you say yes, then in like, I'm 35 out of the 100 possible reports, your manager will think you said yes. Our second one, pair this with a like a visualization. That was called icon array, but it's effectively like a dot diagram where there is, like, a 100 little boxes, 35 of them are filled in, and this helps users visualize. And our final one was just some, like, of examples that we sampled from the underlying distribution. And so if it says yes, 35 of that at the time, then we drew a handful of samples so that users can just, like, observe what are the, like, a likely outputs of this? Do I feel okay with this? And then we asked them several questions including including things like, you know, do you feel confident in making this, decision? Do you feel like you have enough information? Do you want to share data? And we found that our second one of pairing this, like, a text based explanation with a visualization improved users' comprehension of how the privacy guarantees as well as their, like, confidence in their own decision making process.
Speaker 2
20:09 – 20:24
Both papers address critical aspects of DP and its application in real world settings. How might the findings and recommendations from these papers impact the ongoing discourse on privacy regulations and the development of privacy enhancing technologies?
Speaker 0
20:24 – 22:21
I can, like, jump in with this because I feel like this is really where we are going. And as Dan mentioned, we're seeing difference of privacy move from being something like a theoretical conceptual into something applied and practical used by, like, a major organizations and the US government. And so one challenge I really see in this space is that we want to encourage organizations to sort of use a differential privacy, like, because it is a current best practice. However, one major challenge is that our current privacy laws in The US do not incorporate a differential privacy or even most other, like, advanced privacy technologies. That we think as a technical best practice, companies and government organizations should be doing. And so one challenge is if the lawyers are not sure if doing this is legal, then they won't do it. So I think I would love to see, like, a privacy laws as they are evolving incorporate, like, a differential privacy as well as other advanced style privacy enhancing technologies. And we saw recently coming out of the White House, there was an executive order that in fact explicitly mentioned our differential privacy as a desirable privacy technology. And so I think that we're seeing pressure interest from the government acknowledging we should incorporate a differential privacy into our laws. But I think also, as we have been saying, nuance really matters. And so it isn't enough to just say you should do differential privacy, but rather things like the privacy parameters, some of the other, like, security protections that are in place, ethical value decisions around data collection in the first place that Dan mentioned, all of those are important. And so the laws shouldn't just say definitely use that differential privacy, the end, but rather should include a lot of these nuances. And I think this is the space that we're in now is trying to formalize what might we like regulations around privacy to say about, differential privacy.
Speaker 1
22:21 – 26:02
I totally agree. I think, you know, privacy is so hard in a world of ubiquitous data collection. It can feel like impossible to do anything about it. And there is so much excitement about the potential to use all of the data out there about us to solve real problems in science and medicine and public health. And I think regulators are under pressure both to do something about the privacy problems, but also not to stymie that potential to use all of this data about us to do good and interesting things in the world. So they're understandably and I think rightfully excited about potential technical solutions like differential privacy and other privacy enhancing technologies that promise to help us navigate these tensions or these countervailing pressures. But, like, to some extent, we've seen this movie before. You know, earlier anonymization methods made similar kinds of promises. And so regulators created carve outs for so called anonymized data, in the few privacy laws that they passed. And then technical experts showed how brittle those earlier methods were, how easily they could be circumvented in practice. And in part, differential privacy was developed to respond to those weaknesses in older methods. And it is truly an amazing advance over them. It's a really incredible set of technical approaches. But as we've been discussing, you know, unless we're really careful about how we deploy this new generation of privacy enhancing technologies, I think we run the risk of, like, doing the same thing all over again. So we need to educate people, both individuals and lawmakers, about these tools. And, you know, the work Rachel was describing that she and her colleagues have been doing to help translate these, like, very unintuitive mathematical concepts into frames that people can wrestle with more intuitively, is really crucial for for doing that. I think we need to think carefully about when, like, in which cases or contexts these are appropriate tools for helping us solve problems and when they're not the right tools for helping us solve them. When we do decide to use differential privacy, we should pay close attention to the kinds of implementation details that we've been talking about throughout this conversation. And importantly, I think, you know, Rachel alluded to this a little bit. I think we need to think about and pay attention to who is implementing these tools. So are we talking about institutions like the census, for example, that we have reason to trust will act in our interests? Or are we talking about differential privacy being put into practice by private firms like Apple or Google that, you know, might be acting in our interests, or they might be pursuing their own goals that might run at cross purposes to our interests. And, you know, we've seen cases where companies have made privacy promises using the language of differential privacy that in practice turned out not to be particularly meaningful. So I think we should be, you know, skeptical of those claims, not take them at face value. But in general, just be really sensitive not only to this tool, the set of tools, and how it's being used, but also to who's who's using the tools and who's making decisions about how how they're implemented.
Speaker 2
26:03 – 26:18
So I wanna look ahead a bit, and I would like to ask, what are the potential challenges and opportunities in furthering our understanding and implementation of DP, particularly in the light of ever evolving privacy concerns and technological advancement?
Speaker 1
26:19 – 27:47
So I'll jump in first and quickly just say, you know, I've I I think I've been sounding very, careful and maybe even skeptical notes. I wanna emphasize that I think differential privacy is amazing. It offers a really incredible set of tools that can help us meaningfully advance our privacy interests if implemented carefully and thoughtfully. But because it's so technically complex and because as I've been saying, you know, so much of its real world promise depends on how it's implemented in practice, how the math is translated into real world socio technical systems. There are just real challenges both around helping individuals understand it, but also as Jeremy and I emphasize in our paper and, you know, in understanding and shaping the institutional contexts that these tools are embedded in. So, you know, and if we fail to do that, if we fail to to deploy these tools carefully, like I said before, I think we run the risk in practice of differential privacy creating more privacy theater and less actual privacy. So I'm really, you know, thrilled that groups like CDT are trying to help people and policy makers and regulators understand these tools so that we can figure out how to deploy them in ways that really meaningfully advance advance people's privacy.
Speaker 0
27:48 – 29:03
And I'll just build on that. Dan, I love the fact that you pointed out deployments look different in different contexts, and I think this is so valuable as well. I think one of the challenges in sort of, implementing differential privacy in practice in the wild broadly and looking at, like, a privacy regulations to accompany that is it is not not necessarily one size fits all, but rather a lot of these implementations are different, and they should be different across different contexts. It makes sense. Census would do things differently from places like, you know, Apple and Google should do things differently from from places like, you know, Apple and Google, should do things differently from startup, should do things differently for maybe, like, a medical research groups. All these things have different privacy needs, have different accuracy needs, and they have different infrastructural abilities, and they have different, like, a trust from their users, and that should be reflected in the appropriate privacy tool and privacy implementation in that context. And so this is what makes it challenging because you can't just say everybody should use a differential privacy in the following way, but in fact, it really should, like, depend on the context in a sort of nuanced complex way. And so then extending it into, like, a future deployments
Speaker 2
29:04 – 29:25
can be a slower process even though we think it has a lot of promise and a lot of value add. Wow. Well, thank you both so so very much for being here today. I think this is the perfect place to stop. We will, of course, add a link to both of your papers, within the description of this podcast. But I wanna take a second out to say thank you to Rachel and Dan for joining us here on Tech Talks. We really appreciate
Speaker 0
29:25 – 29:31
you joining us and lending your voice to our platform. Thanks so much for having us. Absolutely, Jamal. Thank you.
Speaker 2
29:31 – 29:47
Of course. And for all of our listeners to keep up with the work of CDT's policy teams, please visit us at cdt.org and follow us on Facebook, Mastodon, LinkedIn, and the social media platform formerly known as Twitter at SentimpTech. I'm Jamal Magdi, and thank you for talking tech.