Data Leverage Metagov

Speaker 1 0:00 – 0:00

Wonderful. So, at today's MediGrowth seminar, I'd like to, welcome Nicholas Vincent. He is, very soon to be a PhD holder. Let's see. Technology and social behavior at Northwestern. A really large portfolio, a lot around the management of data as it trains artificial intelligence and a couple other prongs in HCI. So with that, Nick?

Speaker 2 0:15 – 0:15

Awesome. Sounds great. Thanks so much for having me. Super excited to be here. So this is gonna be a talk on data leverage, a framework for empowering the public to mitigate harms of artificial intelligence. Just a couple notes about this talk. So this is originally kind of an very academic y talk aimed at, like, a human computer interaction audience, but I have modified it a little bit to talk more about governance and things that I think will be interesting to folks here. Please feel free to ask questions throughout. I really like that. Or leave questions in the chat. And there's, like, kind of five distinct sections, so I'll try to just check the chat really briefly as I transition between those sections in case there's things that you wanna ask about. And I'm also gonna kind of breeze through some of the paper technical detail sections, but we can run back to those at the end if you want to just to try to keep things on the discussion y side and not the me talking side too much. So with that, I will dive right in. So to start off, I just wanna share a little bit about the methods I use and kind of the disciplines where this this work that I'm gonna talk about lives. So I've done some observational and descriptive analyses of large datasets, looking at things like Wikipedia, Reddit, the big book corpus dataset that's used to train a lot of, natural language processing models. I've done some algorithm auditing, especially looking at search engines and kind of scraping search engines to investigate where the data, that is kind of giving search engine users their answers is coming from. And I've done some machine learning experiments, especially around recommender systems and data evaluation. And then finally, also interested broadly in kind of content analysis and building tools for users, which is where I think there's gonna be some exciting overlap with the stuff that folks here are working on. A little bit about the motivation for the work. So ever since I started grad school, I'm I'm generally excited about the potential of AI or, maybe data dependent systems is a better term, but, concerned about the potential negative impacts, especially the likelihood for major inequalities in power and wealth, and the fact that these kind of data dependent systems seem to be really good at compounding those inequalities. And at the same time, I was interested in the underappreciated value on the data side. For instance, there was kind of growing evidence at the time that Wikipedia was really critical a really critical dependency in, like, most of AI and a lot of computer science disciplines. And when discussing search and recommendation, I felt people were undervaluing the role of, kind of explicitly created data like Wikipedia and also, trace data provided by users just clicking search links. This untapped value has an upside, though. It's a potential new source of collective power. So kind of a key hypothesis for me is that with better, designed AI systems, maybe people can use that collective power that emerges from their data contributions to work towards mutually beneficial relationships where the data creators and the data subjects are kind of enjoying the prosperity alongside the people operating these so called AI technologies. So right now, the choices that people make about what AI systems they support with data or not, you know, which systems they don't support, is kind of reached under heavy information asymmetry. Most people don't really have information about the, individual value of their contributions, let alone the many possible combinatorial valuations. So what's my the value of my data alone versus what's the value of my data combined with everyone on the Zoom call right now? And so when I started this research, the discussion about negative impacts of computing and concerns around big tech, it was really just getting started. Even going back to 2018, there was a lot of pushback to talking about negative impacts. But the nice thing is that since then, this discussion has become a lot more mainstream. And, actually, there's been quite a bit of media attention in this area, including around the work that that I've been a part of. And so if you wanna read more about kind of my work in a a journalistic piece, I think this MIT tech review piece from Karen Howells is really good, does a really great job describing the work. I would recommend that. So a prominent concern in computing is is that computing is and will continue to be an amplifier of economic inequality via automation, superstar effects, the concentration of wealth. However, there's also other harms stemming from computing systems and tech company practices. This is things like systems that amplify historical inequalities, changing notions of privacy, threats to democratic processes, and and maybe even long new long term environmental threats. And just to foreshadow a little bit about where's how this is all going to connect to the things that I I think are interesting to folks on the call is that I'm I I'm pretty confident that we can frame addressing a lot of these AI harms as governance questions, which means that we can take advantage of the findings, the tools, the positive energy, the public attention, all these things that, I think that you're all's work is kind of helping to garner and build and and coalesce around. So the overarching goals of my research is to kind of, one, measure the value of existing data sources using things like causal inference, audit studies, dataset documentation. And then also if if you're kind of curious if you're in the academic space, these are kind of the venues that I that I tend to to publish work at and and hang out at. So it's like CHI, ICWSM, CCW, maybe not interesting to everyone here. Then the second thing is supporting collective action around data value. And so this means identifying what are the types of collective action that people who contribute data to data dependent systems can do. And by the way, my my view on that is that basically everyone is contributing. Anyone who is using the Internet or modern computing devices is a data contributor. And then I I I think there's a lot of value in simulating possible collective action outcomes. So simulating what would happen if people withhold their data from a recommender system, things along those lines, and I'll talk a little bit more about that later. Okay. Cool. So as promised, I will look at the chat right now. Okay. Cool. Nothing yet. But, again, feel free to interrupt. So the main idea a big idea behind this all is data as labor, and this may be familiar. I think this is something that I have taken a lot from the radical exchange community, which I think folks here are some folks here are aware of. And so the idea here the public is providing a huge amount of the data fueling these technologies. It's been called data labor. And, basically, this means that there's kind of two distinct connections from the public people to tech company profits. There's the classic consumer relationship, consume products and ads. But then there's also the data laborer relationship where you perform data labor, you click links, you label things, you contribute to Wikipedia, you post on social media. And this also is kind of a direct causal link to company revenues. And so this owes much to the books who owns the future, radical markets, and the r x c group is kind of, you know, really pushing in this space forward. I've I've been really lucky to kind of get get add in some mentoring from them. And so there's also, for instance, you know, some podcasts on this topic as well if you are more curious after listening to this talk. So just a quick example what I mean by data labor. So consider a search engine. We can actually say that search engines are relying on two distinct classes of data labor. There's the people creating content like Wikipedia articles or answering things on Stack Exchange that are kind of populating the search results. This is explicit active data labor. But then there's also just the people clicking things, the behavioral data, the the the trace data, the passive data, or the implicit data, lots of names. And this is kind of things like the decision to dwell on a search result, distinct patterns in browsing. This is kind of naturalistic. It's it's in the wild behavior, but it's also, you know, very, very important for these search engines to be to be successful and and serve user needs. Another example here. Oops. I am frozen. Oh, there we go. It's considered a language model, which has kind of dominated a lot of the AI discourse recently. So there's the text that peop you need to write text to create a training set for language model. So this could be posts on Reddit, Twitter, books that are published by by authors to sell, articles and blogs. But then there's also things like sampling choices, which is, oh, the, you know, the data dependent technology operators need to choose. We're gonna include Reddit or we're gonna include posts with a certain score. We're gonna include long comment chains. And so there's kind of interesting implications here where in, one language model, this was, Facebook's or Meta's. They basically decided to use the length of comment chains, which means that anyone who responded to a comment on Reddit was, basically voting for that comment to be fed to an AI system. So if you go on Reddit and you're really angry and you wanna tell someone I don't like this, you're kind of performing data labor to help get that text into the training set. And this is kind of an idea that a lot of these data contributions could be viewed, through a voting lens, which I'll return to as well. So the first part of my work, maybe basically the first three years of my PhD, was really focused on measuring the value of data, and I'm gonna go a little faster through this section. I think that this will be hopefully exciting, but maybe a little less interesting than the second half. And so, the idea here is we we kind of knew beforehand that online platforms and search engines rely on user generated content, but can we get a quantitative, kind of some estimates of just how much? And so this involve observational data analysis and then also algorithm auditing. I'm looking at things like Wikipedia, Reddit, Stack Exchange, and search engines. So in in early study in 2018, we looked at how Wikipedia links generate engagement and thus revenue for Stack Overflow and Reddit. And so we did this as a kind of observational causal analysis. So the upper bound scenario that we imagine is let's imagine that Wikipedia links don't exist. So we just find all the Wikipedia links on the platform. Imagine they disappeared. How much how many page views disappear? How much points, you know, Reddit score, Reddit karma, how much disappears? And then to get a lower bound estimate, we use propensity score matching to measure the treatment effect of adding a Wikipedia link to a post on these platforms. And the key point here, we can kinda go back into the details here at the end if if folks are interested, is that it does seem like Wikipedia is providing tons of value outside Wikipedia. Numbers wise, so we saw that this was kind of on the order of 0.5 to 0.7% of all the points awarded on, on Reddit and more on Stack Overflow. And then kind of doing some back of the napkin estimates about how much, ad revenue both of the platforms make. We estimated it's it's around a 100 k per year, order of magnitude for the past basically ten ten years, which is, you know, not a huge not a not a ton of money. Not it's not all of the revenue, but, it's also not not $50. So a second study in this this domain was looking at Wikipedia and search results. And so the idea here is just you know we're gonna collect a bunch of search results. And how much of these serps the search engine results pages is being filled by by user generated content. And spoiler here is that it's lots of UGC, mainly Wikipedia. We also did some follow-up work work looking at many search engines. So Google, Bing, DuckDuckGo, looking at mobile devices, looking at where what we called spatial incidence rates or just where are these links in the page. And so just here's an example here. Right? Is that you have a you have, like, a query for NBA. And even this there's this kind of fancy widget, but there's also a Wikipedia article at the very top of the very top of the page. And this is a pretty common trend that you'll see in lots of queries. And so what we can do is we can do this across a lot of queries, kind of count up where all the Wikipedia articles are showing up, and try to broadly understand just how important how much of of Google is just sending people to Wikipedia, same for Bing, same for DuckDuckGo. And so results wise here, we saw that it was really, really large. So these these bar plots here are showing that for a variety of query categories, we see in some cases it's it's the like 67 to 84% of the of the pages have a Wikipedia article. And then as we kind of look at mobile devices, or we look at, you know, how much is it appearing just above the fold, these go down a little bit. But long story short, lots of Wikipedia articles in lots of search engine results pages. So in a certain sense, our Wikipedia editors already governing AI systems in the sense that decisions in the Wikipedia community have a really large, quote, unquote, causal effect on AI system behavior? Should the editors also have a say regarding the tech company practices if these tech companies are really so reliant? I think this is an interesting open question, and there's there's kind of been some movement in this space as well. But this is just to give you a sense of of where the connections to governance are happening. Okay. Now we are at another section breaking point, and I will just check the chat really quickly. And if not, I will just, you know, delve forward so we can have more discussion at the end. Okay. So the second part, I wanna talk about supporting data leverage collective action, and I'll introduce the idea of data leverage, this framework that we've been developing to think about collective action that relates to kind of doing things with data. So in a a fact paper, this is kind of a big framework paper where we set up to categorize and evaluate the different ways that members of the public can exert leverage through their data contributing behaviors. And we wanted to pull together an array of concepts that can all be kind of usefully applied in this space of responsible AI. And chronologically, so I'll kind of note that we did this work after we'd already started doing the empirical the simulation work. But to make everything a bit easier to follow, I'm gonna start by talking about this framework first and then talk a little bit about a specific example of a of a study where we're kind of deep diving with with simulations and and real data. So the key goal here categorize this, and we're drawing on things like data scaling, learning curves, data poisoning, use and nonuse, and then data portability as well. So to reiterate the key idea here, it's that because data dependent technologies rely on the public, the public has the ability to make these technologies worse or make a competitor's technologies better. And this is a potential source of leverage over the system's operators because these are outcomes that the people in power might listen to. Basically, the the public has their hands on a lever that affects technology performance, which affects revenue, and that is in the incentive system of the people the decision makers in pretty much most companies. So with a long enough lever, you and your broader network might be able to move large firms. That's kind of the the key idea here. And so kind of a formal definition now is data leverage is the power to influence a company held by those who implicitly or explicitly contribute data on which that company relies. What does it actually look like? Well, at a high level, we can think of kind of two buckets of data leverage. Either a group of people bring down a data dependent system, so say a machine learning recommender system, by withholding, by deleting, or by poisoning their data contributions, or a group of people boost up a target's competitor, perhaps a competitor that is is better in some ethical sense or kind of advances the values of of a group more than the, the incumbent. And they could do that by redirecting data or generating new data for that competitor. And so I really like the the data leverage, I think the data data leverage idea kind of handles some some thorny questions in this space. So regardless of whether someone thinks a particular data generating activity should be called labor or not, maybe you take issue with me saying that you clicking search links as labor. You think that that your, you know, personal definition of labor does not fit that. And regardless of arguments about which people deserve credit for a piece of data, So if you think that you deserve more credit for your DNA data than than your parents or or vice versa, you think your parents deserve more credit, like, I I guess you could, you know, take a stance either way. Regardless of all of that, we can always just ask who has the agency to impact the outcomes of downstream technologies. You know, what data contributions do you have the ability to withhold, to delete, to modify, to transfer, to create a new? And this gives us a very kind of, consequence driven way of of bypassing these questions about is it labor? Is it not labor? Who deserves credit? It's who can actually hurt or help the technologies? So we set the stage a bit. So in our in our fact paper, we talk about kind of these three clear categories and call them data levers. And we argue that these are all tools in the public's tool belt. And another key point is that these data levers are tools that can be strengthened. So new laws, research artifacts, and changes to how technology is designed can boost the power, of each particular data lever, and different policies will will do so in different ways. I'll talk about that a little bit. So we can think about data leverage as either lowering a technology performance or boosting up a competitor. And a a point that I'll return to later is that data leverage can really be seen as a kind of distributed way of aligning AI systems with the public's values. If people have meaningful opportunities to engage in data leverage behaviors, we'd expect over time a shift towards more technologies that people agree with at a large scale. So within these two categories, over in the lower performance bucket, we have two levers, data strikes, data poisoning. And then in the boost performance, we just have one right now. We called it conscious data contribution. You could also think of this as as data donation or data giving. We also we picked this acronym before the CDC was in a lot of these headlines, and so we realized it's a bad acronym now or an overloaded acronym at the at the very least. But the the terminology, not so important. So a little bit about these three. So data strikes involve the group people withholding or deleting their data to lower a technology's performance. So, you could imagine people using fake accounts to watch YouTube videos so that the view data can't be tied to their their real account. You might also think of tracking blockers, deletion requests, privacy enhancing technologies. There are some important questions about if companies can just delete training data but keep model weights, something like data laundering. But it actually seems like the the FTC, at least in The US, is not going to really let that stand too much. This will be an open research question going forward, though, for sure. Data poisoning, this is the second lever, and the idea here is to provide data that is deceptive or trains a technology to perform badly. And this is an idea that's been studied in in machine learning and in obfuscation and then kind of, like, protest, for quite some time. Examples here could range from click clicking like on a video that you actually hate. That's kind of the easy side all the way to making pixel level manipulations based on assumptions about how a particular model is, you know, training or optimizing or what the the underlying data looks like. So there's a big range of data poisoning behaviors that we could consider here that all allow people to do collective action. And then finally, conscious data contribution is giving old or new data to a competitor. It's kind of we think of it as the data version of of conscious consumerism, voting with your data. Of course, I do know that conscious consumerism is kind of a fraught topic as well, and and some people, you know, think that it, in some cases, does more harm than good. But in the CDC case, we're pretty hopeful that this is gonna do a lot of good because it's very easy to data is very cheap to replicate and share. And so you could do things like download your purchase history to help a startup make recommendations or, you know, report content, add new do help do moderation on a social media platform that you wanna support. And so in terms of the barrier to entry, we in in our fact paper, we spent some time thinking about how easy are all of these things. In short, we think that poisoning has the greatest barrier to entry. You need to know about who you're targeting. You probably need to have some tech skills. You probably need some sustained effort. Strikes more moderate barrier to entry. It can be challenging to stop using tech. You may depend on data deletion regulation or or someone building you a good tracking blocker tool, but not too hard. And then data contribution, low barrier to entry. You can keep using tech a while giving data to tech b. You may need some some regulatory support, but generally the easiest, sort of thing. In terms of the legal and ethical considerations of data leverage, we think that poisoning has some serious challenges to solve. It's it's morally fraud. It's potentially illegal in some cases. It's still a good option in other cases, but it requires the most consideration for sure. Data contribution has a moderate amount of concern there. On the one hand, you don't really want to boost. If you're mad at a tech company because you think their technologies are harmful, it's probably not a good idea to go boost up that same technology for another company. And then there's also major privacy concerns. It's probably not responsible to for academics to be widely, you know, saying everyone just do collective action where you give all your data out. That that could definitely lead to some major privacy concerns down the line. So some nuance is needed here. Data strikes, we actually think are are pretty, pretty easy to support. They're aligned with privacy initiatives broadly. Privacy laws will help people do data strikes, and the privacy argument can justify data strikes as well. Even if you don't wanna change a particular behavior of a company, you might wanna do a data strike just purely on privacy grounds. And so to connect these with the lever imagery, if we have tech company practices over on the right here that we're trying to move or or negative impacts, we can kind of think of data strikes as this this short lever, but big mass because they're easier it's easier to get that big mass, but the maximum impact is not as big. Data contribution is kind of in the middle. And then data poisoning is this tippy toe on the end because, actually, you can you can always solve eventually, tech companies will hire data scientists, and they will beat the data poisoning, and then it will fall off and reduces back to a a data strike again. So just a little bit of a fun imagery there. So let me look at the chat again. Oh, would greater interoperability hurt help or hurt data leverage? So I think broadly, it will really help on the data contribution side for sure. And I'll talk a little bit after this this next paper that I'm gonna go through really briefly why data contribution might be especially exciting from a theory of collective action perspective. But, basically, I think that if it the barrier to entry or the cost and time time and money for me to go from platform a to platform b without having to, you know, download 10 browser extensions or kind of be a super machine learning geek, that would that would broadly help make data leverage happen more often. More people will do more data leverage behaviors if that's possible. So okay. I've walked through the framework. I'm gonna zoom in a bit and talk about our earlier work simulating data strikes. And this is the first work that started to really seriously think about collective action around data. At the time, many people thought the concept was very implausible, but now this kind of collective action seems a lot more plausible. So in this paper, we set out to simulate data strikes and data boycotts, which is basically just you both delete your data and stop using the system entirely. In a data strike, you don't have to stop using the system entirely. You just use p you use privacy enhancing technologies to not create less or no observations that are usable for the company. And so we use movie lens data here, which is a really popular recommender systems dataset that's actually been just wildly influential on the the research area that studies these these recommender systems. So the key overlap of governance, again, just to bring that back, is that each data contribution can be viewed as a vote to make the model behave a certain way. Model behavior impacts company profits. So data contributors, the public, are are mostly unwittingly voting on company success each time that you click a Youtube link or a search link or up, you know, write a review of a restaurant on Yelp. You're kind of voting on how a model, a downstream model, is going to serve other users which will impact the company success. So in our experiments, we used we were aiming for kind of a standard, very cookie cutter recommender system approach. So we use this implementation of a the singular value decomposition recommender system approach. We can talk about that more at the end if that's interesting. And yeah, basically just trying to be very standard. Nothing nothing crazy here. And the kind of task that we're thinking about is where we have a bunch of data about, you know, and likes movie one, Bob likes movie four, Chen likes movie seven. And so we know Anne gave movie one a four star rating. We wanna know what does Anne think about movie two. This is like the core problem. This is the machine learning system that we're looking at here. And so there's an interesting problem where we have Anne and Bob. And if we think that we imagine that Anne is gonna boycott the system, she's gonna get no recommendations. We can't compute any performance metrics for Anne because Anne disappeared entirely, and Anne buys no products and watches no ads. So that that's Anne in in this example. But Bob does a data strike, keeps using the system and gets these bad recommendations because Bob has no data anymore. All of Bob's performance metrics go down, but Bob is still buying products and watching ads. So there's a big problem, which is that Bob the striker looks like a more effective protester than Anne the boycoter, in terms of these system level performance metrics. So we need to have a metric. When we do these simulations, we need a metric that accounts for the fact that a boycott means no direct revenue at all to some kind of proxy. And so what we did here is we used surfaced hits, which is basically a variant of if you're into recommender systems, it's a variant of precision at k, but each user's k is based on total number of items they rated. And we looked at a lot of other stuff too. If you're not into recommender systems, the point here is just that when you try to simulate these collective action problems, some interesting little quirks arise where you can't just use the off the shelf machine learning evaluation techniques because you're allowing even allowing basically normal machine learning simulations don't even allow the possibility that users are gonna do these kind of complicated protest behaviors. And so we have to kind of adapt to them, which is which is cool and interesting, and also leads to some funny outcomes. So in this study, just to to go kind of quickly again here because we can come back to the technical details and I wanna have lots of time to talk. We saw that, in terms of the this revenue proxy, obviously, the best thing possible is to just leave the platform entirely. And in fact, that's the green line here. That's leave the platform entirely. That is gonna do a lot more damage than if you keep using the platform, but just withhold your data. Because using the platform, you're still consuming ads or buying products. That being said, we did see that strikes can be can be quite powerful here. So do I have the zoomed in one? Yes. I do. This is showing basically the the drop in performance from at the top is the kind of best case performance that's using like the the, you know, the full data set most up to date algorithm. And then the bottom red dotted line, that's a that's a baseline where you just recommend the popular stuff. So this is the, you know, just tell everyone to watch Star Wars or Indiana. I guess those are dated now. But back in the day it would have been, you know, just tell everyone to watch Star Wars and Indiana Jones, or something like that. And actually so, looking at this curve about 30% of users doing a data strike brings us halfway towards that unpersonalized, unintelligent baseline. We can also think of that as setting us back twenty years in progress because here is a the kind of state of the art from 2001 shown on this plot here. A data strike can can really can really have a lot of effect if you get a lot of people. Okay. And so kind of building off this work, we also did some simulations of data leverage more broadly where we look at much more more systems than just recommender systems. We also consider, data strikes versus conscious data contributions. So do you want to delete data from one or give data to another? And I'm just going to walk through one plot here that will hopefully summarize this very quickly. So illustrating data leverage, how conscious data contribution and data strikes allow the public to impact machine learning. So we have a learning curve here. This is a computer vision system using this popular, cipher 10 desk, dataset. And on the x axis, we have how much data is available, which in our scenario we're imagining is a function of collective action, not other things. And on the y axis, we have that test accuracy. So if we look over here, we can kind of imagine going if we kind of travel the x axis from full data to 0.8 or 80% data, we can think of that as a data strike with 20% of users. And in that this case, this has very minimal effect on the large company's image classifier because we're already in this flat region of performance, the diminishing returns, region where there's it doesn't make a big difference. So just getting 20% of users has almost no effect at all. On the other hand, if we imagine that 20% of users engage in data contribution, and so we're traveling from zero to 0.2 on the x axis now, this brings the small companies, image classify an imaginary small company. Their image classification would jump from around 10% to 80% because we're in this vertical region of the curve. So conscious data contribution is hugely reducing the performance gap between the two companies. So here, I'm I'm I'm I'm just gonna look at the clock. I'm gonna skip through these for now, and we can come back later. Basically, we did a bunch of simulations, used a bunch of papers, and we can come back to this if if folks are interested. I'm very excited about it, but I do want to just make sure that we have lots of time to talk. So I'm going to skip to just a key point from this. Okay. Sorry. So there is one key point, which is that just based on that slide that I was just walking through with the the performance curve, we see this idea that data strikes start out weak and accelerate. So if I just get 20% of people, no effect. But if I get 50% of people, okay, there's a big effect now. On the other hand, data contribution starts out strong and decelerates because data contribution matches the the learning curve exactly. We're getting 20% of the people can get you 80% of the way to the best case scenario. But as you keep accruing more more users into your campaign, the the impact stops. And so this has some really interesting connections with collective action theory from Oliver and Marwell and and Texeira and some other folks as well. There's there's a big long line of scholars who've worked in this. And they they basically talk about how if you're trying to get you're trying to support collective action, that figuring out if the production function or if the the kind of returns to scale that you're gonna get are accelerating or decelerating should really impact what kind of interventions that you use to support that collective action. And so these learning curves look exactly like the the production function curves that that these folks study in this old mathematical sociology work. So I'm really excited about kind of exploring these connections in particular. I think this is a really rich space. So just to walk through some next steps really quickly. So here's again kind of the overarching goals of my research, and here's kind of three things that I'm thinking about. So one is data valuation and data dividends. Another is using leverage for, AI values alignment. And then the third is data leverage assistance. So a data leverage assistant, you can imagine as a website that allows you to go search for ongoing movements, so drawing on social and search technologies and studies of social movements. Perhaps you download a browser extension, and then it gives you a data strike option, a data poisoning option, a data contribution option. And really importantly, I think it should tell you about the likely impact of your actions. And this is where there's also lots of overlap with with governance tools, where I think the governance tools that tell you about how, if you vote a certain way, you might be able to achieve this kind of action or if you, you know, amass a a critical mass of other people to join your coalition, you can achieve an outcome. So one thing that we can imagine here is is tools that visualize or communicate the impacts of data leverage. So this is a prototype screenshot of a prototype using observable notebooks that lets users play around with a very simplified view of data leverage. So these these are curves based on results from scaling papers, and you can kind of drag the slide is around or click play and and see how different size groups of people could have different impacts on a machine learning technology. Another ripe area for for HCI work is is telling people about how much impact they have. And, again, yeah, I think that there's really rich opportunities to connect this with sociological knowledge about what makes collective action work. Data evaluation and data dividends. This is kind of an idea that there's been a bunch of work recently in machine learning that are trying to estimate the value of AI, of data, of individual observations to AI systems. So predicting, you know, what will be the impact for moving a particular training data observation. And something we could do with that is, use it to pay people. Just just pay people for data. Maybe this could solve the economic inequality concern. But this means we have to figure out how much value each person contributed, And there's lots of confusing implications here. If you're curious about this topic, we actually have a report that we published with Berggruen that is trying to propose a really pragmatic way of doing this. There's tons of questions, you know, who's gonna fund it? What are the tasks that we're considering? What's the time period of data contributions considered? How is it dispersed? And how are how's the valuation done? Just a really quick example. So for instance, just imagine we have just three people and they have data values of negative one, zero, and one. And we want to transform these into positive monetary amounts. We're assuming that there's no, no data debt in the first version of, of a data dividend. That would probably be a non starter. Do we, do we just shift these values over? Do we take the absolute value? Do we clip at zero? Do we do some kind of binning procedure? Well, it turns out that just these, like, seemingly innocuous choices can lead to, either very low inequality or very high inequality. So this this might seem very arbitrary. There's a lot of open questions around data valuation and dividends. And I guess my my summary point, my takeaway here is that I think that the way to go is to do collective data valuation where we kind of think about The value in groups of people and, ideally, these would correspond to actual coalitions of people so something like data unions or data intermediaries they've been called. The point here is there's no one true data value definition, the designers have to make a choice, to the extent that we can kind of give some power to the data dividend recipients, all of this assuming a data dividend ever happens, that would be ideal. And finally, one last point, and I will stop talking, and I'm really excited to hear what you all think, is that I think that data leverage is a really promising path for AI values alignment. So if people have more agency to withhold and redirect their data contributions, they can, over time, push AI systems in line with their values. Perhaps they'll push for data dividends. Perhaps they'll push for more privacy and friendly models. That's something that I I spent a lot of time working on with with folks at Snap, which was which is very exciting. And then some alignment may happen naturally as new data laws and practices come into play, but maybe we can accelerate this process by providing goal posts as standards for how much data agency should organizations be giving their data contributors. And so these goal posts or best practices can help to share decision making power with data contributors, which ultimately, in my opinion, makes it a community governance question. And so just to end with a summarizing point, data leverage provides a new way of thinking about negative impacts of computing systems and working towards positive some futures. It is of course all collective eff effort. Here's a bunch of groups of, you know, here's institutions that I've been really lucky to work with and have support from, and then a classic academic headshot slide with lots of pictures of people who have also, you know, been really helpful mentors and collaborators. And with that, I will put my information up in case you like to contact me, and I'm super excited. I hope we saw, yeah, some time to chat at the end here. Thanks so much.

Speaker 1 0:30 – 0:30

Great work. Thank you so much. We'll, we'll have oh, you know, we we can do no. We'll do a plug at the end. We can dive right into questions. I'd love to start with Bobby.

Speaker 3 0:45 – 0:45

Hi. So good to see you, Nick, and and hear about

Speaker 2 1:00 – 1:00

your work

Speaker 3 1:15 – 1:15

in in this context too. And, yeah, I was really curious about and and sort of your vision around, like, enforcement. And I'm I'm sure, that's something that I mean, it's, of course, like, really challenging in in many of these contexts, but I I've been reading a lot about the, FTC's recent, proposal, and they're looking for help and public input right on what they're calling advanced notice of proposal rulemaking. And they're seeking public comment on that. And it actually goes in like really a breadth of examples and and depth in how they're defining harm and what they're calling commercial surveillance. And I think it has, a lot of overlap with what what you're talking about in many ways. So, yeah, I'm just wondering whether, kind of do you imagine, like, new kinds of, sort of social and computational and legal rules that could come out of your work that the FTC could enforce? And I guess, yeah, how are you broadly thinking about enforcement?

Speaker 2 1:30 – 1:30

Yeah. That's that's super exciting. Thank you for for sharing that. I'll I'll definitely check that out, and maybe I can write a comment. I I think that there's there's huge overlap. My my ideal vision is I think that basically the the data leverage, like, research agenda is best served by pushing equally on the, like, tool making front and the policy front, like, kind of simultaneously. I think that policy has the potential to really, really amplify the power here. Like the most extreme so we've seen some examples of basically, I think the FTC specifically enforcing, like, model weight deletion. And I think that if that became widespread, that would really change the kind of power calculus of of users and tech companies kind of overnight. If a lot of you know the larger tech companies thought that they're gonna have to delete the the model weights for their search engine or something like that. I think that would really put a lot of power into people's hands. I I think that it's gonna be a bit there's gonna be an uphill battle here. In part because, of course, I did talk about, you know, negative impacts and harms in computing. But my understanding is that lots of people are basically broadly happy with with the status quo. And so I think there's a lot of, you know, awareness building and kind of, like, coalition building and and fostering support that that would still be needed for these things to happen at really large scales. Just see, you know, a data strike, 50% of the population engaging in a data strike against a particular company. That's like still on the horizon and hard to organize. But yeah, I think that these can totally support each other. I think that the the simulation work is actually pretty valuable because it could tell kind of a regulatory body the potential impact of some of these these things, and I think that could be quite motivating. So, yeah, I'm super excited about that.

Speaker 1 1:45 – 1:45

Any follow-up, Bobby?

Speaker 3 2:00 – 2:00

I guess I'm wondering. I think a lot of there's a lot of attention right now with large language models. I wonder if that could be sort of, like, you could use that in a way to, kind of prove certain impact might happen.

Speaker 2 2:15 – 2:15

Yeah. I really hope so. I I've been looking into that a lot. There is like a pretty interesting conundrum though here, which is that the basically the bigger a dataset gets, the harder it is to do any of this empirical work at an academic level. Right? So here's the problem. Right? I can take the movie lens dataset and run a thousand simulations where I try out all the different configurations of users who might be contributing data or deleting data or poisoning data. But I can never do a simulation where I try to see what gpt three would would do if everyone on this call's, you know, social media posts were removed from the training set. And I don't think that they'll do it either. Right? So, yeah, I'm a little there's this tension that I'm still grappling with, which is that as these models get bigger and bigger, it actually becomes harder to make really well calibrated, you know, quantitative statements that I'm confident about, which is tricky. But on the other hand, there's so much attention that we have to we have to kind of channel that somehow, I think.

Speaker 3 2:30 – 2:30

Yeah. Thanks.

Speaker 1 2:45 – 2:45

Next, Amy. And then, Tucker.

Speaker 3 3:00 – 3:00

I thought you had a question before me, Seth, or I'm happy to go. But

Speaker 1 3:15 – 3:15

Yeah. Go ahead.

Speaker 3 3:30 – 3:30

Okay. Yeah. I you kinda talked a bit about it already, but I I had a question about data strikes at the community level. So, like, as a subreddit, we decide to go private, which was an example in case of subreddits kind of protesting against Reddit the platform, or maybe you as a subreddit could decide to auto delete your your old content. Could that have, benefits over an individual approach?

Speaker 2 3:45 – 3:45

100%. Yeah. I think that it's pretty much always better that it when there already exists these governance or hierarchical structures where, like, for instance, a, you know, a group of moderators or leaders are, like, core contributors to a subreddit can can facilitate or, like, kick start it. Like, that will always be more effective just because data data has this, you know, combinator like, growing grows with scale and combinatorial value problem where individuals never have any marginal effect at all pretty much at the at the scale of of most tech company models. So, yeah, I I I guess, yeah, I think that putting effort towards community level engagement is is actually, like, has a much more has a much better ROI than trying to support individual level. Like, the the the building blocks at the individual level are not as strong, but still very important. And, ultimately, I think that the building blocks do come from the same place. So for instance, if a a subreddit wants to like, imagining not a subreddit protesting against against, Reddit itself, but maybe a subreddit protesting against the inclusion of discussions in a in a model. So I think this could be very, like, a a pretty realistic use case here would be a subreddit being upset about researchers doing, like, person level classification test tasks for, like, mental health disorders or other health conditions where they're using people's discussions in a public subreddit, and they want to, get those discussions out of the training set. Doing it at the community level is great and and better in general.

Speaker 3 4:00 – 4:00

Awesome. Thank you.

Speaker 1 4:15 – 4:15

Tucker?

Speaker 2 4:30 – 4:30

Yeah. Thanks. Nick, I was just curious to hear a little bit more about contexts where data poisoning could be illegal at, just because you mentioned that at one point. Yes. So I think, like, the best case where the is like here here's the the counterexample against data poisoning is that I'm really mad at the new at Amazon, you know, acquiring, what is it? One Health. And so I want to go to them and lie to my physician about my, like, health history to because I think that Amazon is gonna train some, you know, health condition classifiers in the future. I think that if, like, my physician found out about this somehow, I I think that would maybe be illegal. And, again, I I should I should have put a big, you know, I'm not a lawyer, sticker on here. Other examples would include trying to obfuscate, like a lot of government agencies. I think, you know, there's a big form that you sign that says I'm not gonna lie to you. Even like the the census, for instance, which is something that you could think about doing data poisoning against. So basically the the thing is that a lot of these, a lot of the intake forms for data involve, you know, kind of entering into some sort of contract with the person to which you're helping them, you know, record an observation. The flip the flip side of this where it's totally, I think, will never be illegal or can never be conclusively proved is a lot of the the passive trace data, the implicit data, like search engines and recommender systems. So my favorite example of here's the data poisoning that I think we can always get away with till the end of time, which is you go on YouTube and you listen to your favorite song, and then you, you know, mechanically turn off your headphones, or you just leave the room and you put on a playlist of your least favorite song ever. And the goal here is to try to get someone else out there who loves your genre of music and also hits the same genre of music as you to get recommended that hated genre of music. And YouTube can never figure this out unless they send an agent to your house to knock on your door and say, well, explain to me what happened on, you know, the day of of 01/01/2022, when you listen to this genre a and then genre b after each other. In these cases, you can never prove it. And so I think you can always get away with data poisoning in that context, but there's also a lot of context where I think you can't get away with it.

Speaker 1 4:45 – 4:45

So Zee's got a question that I kinda resonate with. It it seems like a lot of these these measures are a response to the status quo and in that way a little on the reactive side. So on the on the proactive side in terms of not assuming the status quo, but assuming, like, what we would actually want, Zee's got a question just like what corporate model, you know, would be the right way to do this if we're starting from scratch. Zee, do you wanna actually ask your question for yourself?

Speaker 4 5:00 – 5:00

Yeah. Sure. I mean, basically, the question is, how would you imagine a good actor corporation sort of leaning into these dynamics and having a healthier relationship with the their data dependencies, sort of viewing the data providers both individually and as a collective as partners of the organization rather than simply resources to be mined?

Speaker 2 5:15 – 5:15

Yeah. So I think that I'm, like, increasingly thinking that there's kind of a just a two two criteria. Here here's my here's my proposal, and I'm very curious to get to get pushback or critique on this. I think that if an organization can reasonably inform all of the data their data dependencies, their their data laborers about and I say reasonably here because, of course, you can never kind of conclusively see what the future will hold for data, but reasonably inform about how the data will be used. You know, what is the observation going to look like at least at an abstract level, and what are all the downstream things that are going to be fed by by this fuel, so to speak? And then give a reasonable ability to opt out with, you know, a a given basically, if there if there's actual market dynamics going on here where I can I can really I know what's going on, and I can pull it out, or I can withdraw it without, you know, excessive time and energy cost or being like a, you know, super into privacy enhancing technologies and downloading 20 browser extensions and, you know, wearing a mask when I got in public that is specifically designed to fool computer vision or whatever? If you can make these two criteria, I think that you're actually well on your way. And I would say that the status quo is is really, really far on both ends. And so it it sounds simple. Yeah. I don't know. I'm curious. Do you think that there's other criteria that might be that might be at play here too? I think just these two alone would be basically the use awareness and the reasonable ability to to withdraw or redirect would actually get us, like, 90% of the way there.

Speaker 4 5:30 – 5:30

I I agree, but I would put two refinements on each of them. So on the, one related to being able to opt out, I actually think you've highlighted very well the fact that this needs to be done at the group level because of the sort of marginal value problem. So you have to facilitate collective action on that front. Otherwise, you're not really helping. And then in terms of the sorry. The shoot. I lost the thread on the other one. Can you repeat the other one?

Speaker 2 5:45 – 5:45

Oh, yeah. The the basically, the you need to have some the the users or the the data contributors should have some use awareness so that they actually know what the data is being used for so that they can make an informed decision about, do I want to be supporting this thing or not supporting this thing?

Speaker 4 6:00 – 6:00

Yeah. And so the refinement on that one was that there are challenges around maintaining boundaries on use because they tend to be, like, multistage. So, like, if it's just like, oh, I'm gonna sell it into some process, then you don't actually know the end use. So maintaining or providing some transparency of the downstream provenance is something that, I'm I'm pretty sure we need to be better about before you can meaningfully enforce that one.

Speaker 2 6:15 – 6:15

Yeah. That makes sense. Thanks. I just sorry. I'm using the chat as my own here.

Speaker 1 6:30 – 6:30

Maybe Toby and then Ashka?

Speaker 5 6:45 – 6:45

Yeah. I guess pretty, like, concretely. I'm just wondering if you have kind of an inspiring, like, cause or use case here. You know what I mean? It's like besides like, there's a lot of public energy about privacy, and so I think that that's pretty easy to hypothetically imagine how you could organize movement around that. But I'm wondering if you have any other sort of, like, near term, like, causes in your head or, like, visions that of, you know, of what people might actually be interested in?

Speaker 2 7:00 – 7:00

Yeah. So okay. So there's a couple. So kind of going back to the original motivation that I I kind of grouped this in my head into two two categories. And so one is just this broad idea that that data dependent technologies will serve as concentrators of power or concentrators of wealth, if not left unchecked. And the the answer there is this kind of data dividends idea or not necessarily data dividends, but just pushing for alternate economic models where more credit is given to data contributors or actual paychecks are sent out or some portion that this is one that we're pretty excited about, and this is basically in that Berggruen report. We lay out a very concrete vision of this that some portion of data dependent funds should be redirected should should be earmarked for public good spending, basically. So if I you know, if a tech company figures out this really cool new optimization algorithm that makes search engines five times better and starts making five times more ad revenue, then maybe 50% or 25% of that should be used to, you know, build roads or fund education or or fund rural broadband or things like these. So that that's one. And then the second one is is about kind of solving specific AI harms. So this is like getting a if a model is being deployed that has that is amplifying historical biases. So, like, in the insurance space or in the medical space, a college could could just be getting that model shut down, which is, you know, could could vary a lot. But I guess, yeah, in terms of really concrete examples, I think that there's been a lot of energy around facial recognition and especially, like, police use of facial recognition in The US context. This is something where you could imagine trying to withhold data specifically from a model because you don't want to directly support that model, but also just withholding data from a firm because they're supporting a cause. So you could imagine even something along the lines of people withholding code from Microsoft and their attempts to, you know, do do copilot and kind of sell code generation tools because they're unhappy with Microsoft's facial recognition division. And this is maybe this is this is a both a good and bad example because I think that the big tech companies have already started to pull back on the facial recognition side a little bit because of the public pushback. But, yeah, those are a couple examples. Is that is that what you had in mind or other other other things that you had in mind?

Speaker 5 7:15 – 7:15

Yeah. Totally. I think that so it's basically the idea is is once you have, like, some sort of concrete political proposal about, like, you can think of a lot of ideas of, like, redistributing funds since these things are super profitable.

Speaker 3 7:30 – 7:30

Exactly. Then it's

Speaker 5 7:45 – 7:45

just like a tool in the toolbox.

Speaker 2 8:00 – 8:00

And and that's the big one I've thought about a lot. Like, I thought about the most, but I do think that this kind of, like, secondary version where you use data leverage to kind of shut off a particular practice, like stop running this model that is is, you know, performing really badly. That that could be a a really good use case as well.

Speaker 1 8:15 – 8:15

And I think we got time for one more question from Ashka.

Speaker 6 8:30 – 8:30

Hi, Nick. I really liked your presentation. And I just want to ask, and this sort of relates to what you were talking about just now. How do you think users can better convey their motivation for any kind of action that they pick? Because even if you're voting or you're withdrawing your data or you stop using a platform, the big mass organization doesn't really know why you're doing it. So they can't really change their actions that way. So how do you think that could be fixed?

Speaker 2 8:45 – 8:45

Yeah. Super good question. And I think this is gonna circle back to the the answer to all like, a lot of these questions is that we need to have institutional support for the data sharing happening at a collective level. And so I think basically here's here's the ideal way that I I would see this working out is that most people broadly, do their data sharing through a data coalition, and they have a kind of a, you know, menu of different data coalitions to choose from that have different different visions. Like, maybe there's a data coalition that's specifically all about paying you money. Maybe there's a data coalition that's specifically all about, you know, helping to solve issues with rare diseases. Maybe there's one that's about environmentalism. And, basically, if if the firms were getting their big pools of data through the coalitions, I I can use the coalition as my mouthpiece. Or this basically, like first of all, just the kind of coalitions that I join already says something about my motivations. So if basically, just, like, looking at the size if if data coalitions were to exist, the size of different data coalitions would the public's broad interest. Right? It's kind of like a continuous poll that's just being run all the time. But then secondarily, the data coalition can actually negotiate directly with the data user or the the data consumer, the tech company. And this is the best way to do it. If we don't have data coalitions and we have to do it at the individual level, I think it's it's still possible, but it would basically, any kind of solution would be recreating a lot of the conditions of data coalitions or data unions whereby people need would need to basically, you know, go on a social media platform or or some kind of online platform and share their concerns, and other people would need to share the same concern, and they would need to coalesce around a concern. And you'd be kind of building a unstructured sort of data union, which could work as well too. But I think that having the formal support is, like, a key goal that we should really be working towards to to solve a lot of the problems that I talked about.

Speaker 1 9:00 – 9:00

So I'm sorry. Did you have a follow-up question?

Speaker 6 9:15 – 9:15

No. No. I was just saying that the idea about the unions especially and, like, decentralized unions is especially interesting.

Speaker 2 9:30 – 9:30

Oh, yeah. 100%. Yeah. I I I think that could be a really great way to go. Right. It doesn't have to be it doesn't have to be, like, exactly, like, existing kind of unions or coalitions that we already have in other in other domains.

Speaker 1 9:45 – 9:45

So this is a wonderful discussion. Thanks, Nick, so much. If you wanna dawdle for a sec, we can we can chat for a a little moment. But in the meantime, what I'm gonna do, ask everyone to unmute on 3 so that we can all give him an actual physically audible applause. One, two, 3. Alright. Thanks, everybody, and see you next week. When we have who do we have?

Data Leverage Metagov

Top Keywords

Transcript

Listen