Tech Talk: Automated Content Analysis — Talking Tech W/ Jasmine McNealy & Dhanaraj Thakur

CDT Tech Talks | 2021-09-30 | 28:40

We have another exciting show for you this week! Earlier this year, CDT released a new report, Do You See What I See? The Capabilities and Limitations of Automated Multimedia Content Analysis. This report explores a variety of machine learning techniques for analyzing images, video, and audio media, and explains what automated tools can—and can not—tell us about digital content. Here to help us understand more about the capabilities and limitations of automated content analysis are Jasmine McNealy, CDT Non-residential Fellow and associate professor in the Department of Telecommunication, College of Journalism and Communications at the University of Florida, and; and Dhanaraj Thakur, Research Director for CDT. https://cdt.org/insights/do-you-see-what-i-see-capabilities-and-limits-of-automated-multimedia-content-analysis/ https://www.jou.ufl.edu/staff/jasmine-mcnealy/ https://cdt.org/staff/dhanaraj-thakur/ More on our host, Jamal https://bit.ly/cdtjamal Attribution: sounds used from Psykophobia, Taira Komori, BenKoning, Zabuhailo, bloomypetal, guitarguy1985, bmusic92, and offthesky of freesound.org.

Top Keywords

tools 0.017
jasmine 0.015
kinds 0.013
automated 0.012
data 0.010
automated tools 0.009
prediction 0.009
machine 0.009
automated content 0.008
explainability 0.008
content 0.007
risks 0.007

Transcript

Export JSON

Speaker 0 0:00 – 0:26

Hi. I'm Riddhi Shetty. I work on the privacy and data project here at CDT. Recently, we've been advocating for stronger federal and state guidance and regulations against consumer data harms that limit economic opportunity. You can support this and all we do here at CDT by going to cdt.org/techtalk and donating. Every donation matters. Thank you for enhancing civil rights and civil liberties in the digital age.

Speaker 1 0:36 – 0:38

Welcome to Tech Talk

Speaker 2 0:39 – 0:40

by CT.

Speaker 3 0:42 – 1:29

Welcome to CDT's Tech Talk, where we dish on tech and Internet policy while also explaining what these policies mean to our daily lives. I'm Jamal Magby, and it's time to talk tech. Here to help us understand more about the capabilities and limitations of automated content analysis are Jasmine McNeely, CDC nonresidential fellow and associate professor in the Department of Telecommunication, College of Journalism and Communications at the University of Florida, and Dhanaraj Thacker, research director for CDT. Jasmine and Dhanaraj, thank you so much for joining us today. Alright. So to kick us off, can can you both explain what automated content analysis is? Yeah. Maybe I could start if that's okay. And,

Speaker 2 1:30 – 2:40

thank you thank you, Jamal, for hosting this talk. I think this this the topic is very relevant. Right? Automated content analysis has become even particularly relevant in in recent times. I I mean, there there are statistics that suggest that, you know, there are, like, 3,000,000,000 images that are uploaded every day, on YouTube, maybe even as much as $500 of video a minute are created. Right? So, basically, because of the shared scale of content that's been uploaded in in in addition to, like, increased calls among governments and policy makers to restrict particular kinds of content, There there's been, an increased use of these kinds of tools to automatically detect, content of particular particular nature to inform, moderation decisions on social network and on other other services. So so these, these kinds of tools and honest understanding what these tools are and their capabilities and limitation is something that we are particularly concerned with. And then we argue that it's really important that other stakeholders like, you know, policymakers and companies themselves understand what's happening here as

Speaker 1 2:41 – 3:43

well. Right. So I I agree with Dhanraj. I think it's a collection of techniques Mhmm. That, allow for the analysis of text, of images, of video. So all different kinds of media that can be found in a digital, so to speak, format or digitized format, that can then be used for things like content moderation, but also things that marketers wanna know about, like sentiment analysis and other kinds of ways of predicting or attaching meaning to the expression. So the visual, the text, the the video, all kinds of things. So being able to label it or attach meaning to it automatically or using a set of tools or techniques and then being able to make that data then available for whatever purpose that an organization or a government or group wants to use it for.

Speaker 3 3:43 – 3:51

Now from what I understand, there are both matching and predictive models. Can you explain what these are and and how they're used?

Speaker 2 3:52 – 5:16

Yeah. Sure. I think, you know, just following what Jasmine just said in terms of these different kinds of, the range of techniques. Matching on predictive models are a way of this generally grouping these kinds of different techniques. Most of these techniques rely on some form of machine learning, and that's essentially a a means to, like, pass through and analyze large amounts of data to, to identify characteristics or relationships or correlations within that data that really are relevant to the objective of the model. Right. So that could mean like identifying images or particular sounds or video, that that is that's that that, you know, the developers are interested in. And practically speaking, machines can make these kinds of identifications or labeling in in two broad ways. One is matching. So essentially recognizing something that's identical or similar to something that's seen before. Right? And prediction, which is recognizing the characteristics of or features of a piece of content, that the machine, that's based on the machine's prior learning, right, and learning from a large amount of data. So matching is basically, have I seen this image or audio of it before? And prediction is really to what extent does it fit the characteristics of of an image or audio that, that I want to to identify.

Speaker 1 5:16 – 6:01

Right. And and I would add for prediction, the thing about prediction is that, the machine can infer values to for missing values. So if there's enough of one kind of characteristic or a set of characteristics that reminds it of something it's seen before, then, the predictive model set can say this will probably end up more like this past thing even if I don't have the value for these certain characteristics. The other characteristics I do have the value for tell me it's leaning more towards this label or this category or this kind of inference about

Speaker 2 6:02 – 7:17

whatever it is it's making an inference about. Yeah. You know what Jasmine is saying makes me think of, if we're thinking of, like, a practical example. Right? Like, say, you want a model to to, tell you whether image contains a cat or not. Right? And so you could have a classifier that's that's developed that way. Essentially, it's making a prediction. Right? Based on previous data, it has learned from, does the image make, contain its cat? And it's often, presents that, analysis in the form of a prediction. And and and that's how often like, when it comes to these predictive techniques, that's how it's often in a very simple way. That's how it's often works. So it's looking at different kinds of shapes and textures and colors that are are are relevant. But as Jasmine is saying, some sometimes a model might not have all information for a particular variable, like, you know, the what's the color that it's looking for or what the shape or so on. And so it kind of will try and fill the gap, but ultimately, it's making a prediction. And that's what, we have to be clear about. Right? That it's it's making a in in effect, an educated guess as to whether this is,

Speaker 3 7:17 – 7:24

what we are looking for. Jasmine, let me go back to you, and I ask, what are some of the limitations, to automated content analysis?

Speaker 1 7:26 – 11:12

Yeah. So, you know, one of the important limitations there there are many limitations, but one of the important limitations deals with just the fact that for the most part, we're making predictions or attempting to match, and then categorize and put labels on human communication and human behavior. The problem is humans are not rational. Humans don't act normatively. We don't behave as mathematical models even though, you know, organizations and researchers have attempted to put us into mathematical models. We just don't we just don't behave that way. And so when we lack context, when machines cannot zest out context, then the predictions and the matches that they make, are very normative. If I had all the information or all the data, if, people behave rationally, if all things were as I wanted them to be to be to make this very, good educated guess, and this is how humans would behave, but that's not it. A researcher, Desmond Patton, at Columbia University, I mean, uses as an example, looks at how law enforcement attempts to use social media data to make predictions about gang activity. The problem is, the the young people who are they're looking at, they're surveilling online, sometimes, you know what? They throw up language, but they throw up language that is rap lyrics. Right? Yeah. And so, without that context, you'd be like, oh, these kids are just super violent. They're just gonna you know, they're just they're in a fight. But, no, these are song lyrics. Without knowing that, without that context, you have one prediction and one understanding of the message that they've thrown up without understanding the entirety of the context, and then you you are able to label them in such a way so as it so that it hurts them. Right? It places them in a category, criminal, right, or gang member or, you know, about to commit an assault or whatever the case may be, which is harmful. It's particularly harmful on groups that are used to being surveilled, people of color, religious groups, religious minorities, particularly in in United States, protesters, members of what what are called, like, radical movements or whatever, by the US government or or certain members of the government. Right? So these hyper surveilled groups are ready, then you're using data, then you're using data, which is obviously historic. All data is historic, but based in a system that has historically marginalized people. So you're just reifying further predictive effects of that data that has already placed criminal or other kinds of labels, deviant labels on them already. And these models are filling in gaps or making predictions about people that further marginalizes them. Right? So context is really important. Can we can we understand what people are talking about? Can we understand what messages are being, communicated into what audiences? And can we more adequately and accurately understand all of the things that are being communicated and by who? And, then really revise or revamp or get rid of some of these predictions or inferences

Speaker 2 11:12 – 12:37

that are made about marginalized community. Yeah. I I completely agree to you, Jasmine. I I think there there are huge, challenges in getting machines to understand context. And without recognizing that limitation, we might jump ahead and, put too much value on these kinds of predictions. Right? That could have, like, tremendous negative consequences on people. I think I related kind of limitation here is on the data that's being used to train and develop a lot of these, machine learning tools data that's often, made this proportion disproportionately represent some groups of others in society. And what what would happen then is that you're making, poor, you know, poor predictions based on flawed data. And there are lots of examples of how this plays out, for example, like in facial recognition systems, which can misclassify people, right, because of, lack of representation of, of of some groups, in that original dataset. I and I so and and there there are lots of these limitations, like data quality, like, lack of context that point to, the potential discriminator outcomes of using these these kinds of tools. And in our report,

Speaker 3 12:37 – 12:57

do you see what I see? We actually lay out a couple of several of these as well. So we touched on it a little bit in the in the last answer, but I but I wanna just ask, what are some of the risks in implementing some of these automated content moderation tools, large scale, And what risk is specific to historically marginalized communities? I'm sure there's some.

Speaker 1 12:58 – 16:34

Yeah. You know, I think an inherent risk, in using machine learning prediction is that, unfortunately, many people think that, automated tools and machines are infallible and also that they are neutral. And so any prediction or guidance given by a computer, people think is the guidance that should be used. Unfortunately, as we just talked about some of the, risk of implementing automated content, you know, decision tools or other moderation or other kinds of machine learning predictive tools, we we know that there are huge risks. And these risks are being borne out in the news, I think, on a weekly basis, right, particularly facial recognition, other kinds of tools. And because there are these risks, in every almost every place where machine learning or automated tools are being adopted, there are risks. So, the use of now, social media data in connection with things like whether or not you're a good risk for loans. Like, what are you putting what kind of images and what kind of text are you putting on your social media, right, to to, have you, it's part of your kind of credit score in a certain way. What kind of person are you being predicted by automated tools for, again, historically marginalized communities who already, face, you know, discrimination in certain ways for things like base bank loans, for things like criminal justice in the criminal justice system. These risks are then amplified because many people say, well, you know, the machine said we've taken away the bias because humans aren't there, but so the machine says this. But but we know that, tools, and machines and software are not just neutral diviners of data, but that they are reflective of the system, but also the models, and values of the creators. And so that is a really important, I think, thing to consider is that that machines aren't neutral. Machines work in systems. As Denaraj says earlier, machines use data, but also machines use models that are created by somebody. And these are really important things because the risks are actually real. Risk of jail, risk of longer sentencing, risk of of lack of probation, risk of not getting loans, which which then has further impact. Right? So it's not just that you don't get a loan. It's that you don't buy a house in a certain neighborhood. That means we know in The United States that different neighborhoods have even different life expectancy. We have different neighborhoods have different quality of tools. Right? So these are long term impacts that affect, marginalized communities.

Speaker 2 16:34 – 18:30

Yeah. You know, this is this is such an important point that that Desmond has raised in here. We we have to recognize that these limitations exist, and they're really are going to be amplified when you're talking about using these tools at scale. One of the limitations we didn't get to, but I'll I'll mention now is explainability. Right? How can you explain the, you know, how a machine comes up with a particular decision to a human, to a person? So just to follow Jasmine's example about, you know, like, getting a loan or, you know, access to finance or or things like that. Real life changing decisions that, you know, we sometimes rely on these predictive models to to to inform, what happens when they make, like, decisions that are detrimental to And so so, for example, I cannot get access to the kind of finance I want or buy the host I want or things like that because of these, you know, machine led decision. What or course do I have? Right? Can I really understand why the decision was made, unfavorably for me? Like, do I does anyone understand? So that lack of explainability I mean I mean and, you know, there's a there's a lot of efforts are trying to to unearth that, right, and shed more light on explainability. But right now, it's it's a huge problem. And if you're then talking about, using these these machine tools at a very large scale, you're amplifying a lot the problem around lack of explainability. And if that if if it is gonna be that, you know, explainability, becomes something that that that, you know, becomes a premium for people that can afford it and have access to records and access to understand why things happen to them as opposed to those that cannot, then you're introducing, you're you're gonna be exacerbating existing problems of of discrimination. So that's a that's a big problem.

Speaker 3 18:31 – 18:36

So what should companies and and governments be doing to minimize these risks? I mean, so I think,

Speaker 1 18:37 – 21:51

currently, some of the methods that governments are attempting to implement are, legislation related to explainability, as Dhanraj mentioned earlier, but also transparency, you know, requiring organizations to disclose certain things, how what kind of techniques are being used, how data is collected and then used, and by whom or who can access, certain kinds of data that they collect. The problem is, as, Damanah said earlier, we talk about explainability. We talk about transparency. You can be as transparent as you want, but if nobody understands what you're talking about, it's very hard for people to self govern or make decisions about whether or not they really want a certain organization to collect data about them and then use data about them. Right? But but, also, there's, you know, a newer I wouldn't say newer. I would say a more assertive call for legislation that just straight up bans the use or some uses of automated tools because of the risk associated with them. Those risks related to, like, we we noted earlier, prediction and errors in prediction and errors in matching, because it's kind of wild. It's still a little bit of a wild west, with respect to some of these tools and how they're being used, but, also, we also know the impacts. We know that automated tools being used in health care, for example, are making really bad decisions about African Americans and what kind of health care they should get. We know that automated tools, like you mentioned earlier, and criminal justice system, all different aspects are making really bad predictions or offering really bad guidance about certain kinds of people, and it is, having devastating effects. And so there are calls by advocacy groups that say, look. Companies should stop using these tools and stop promoting them to governments, but also that our legislators must step in. The only the the you know, the they're the only people that could possibly ban the use of, and creation of these kinds of techniques and tools. We've seen those kinds of requests for bans coming to fruition in several different municipalities. So Oakland and then Berkeley and, you know, Cambridge and Somerville up in the Boston area. Right? But that's very local on a more national level in The United States. There people are calling for the Fed to do something because this is getting, it's already out of control, but even more so as more tools are being created and techniques are being used. Yeah. And, you know you know, Jasmine, I I worry about the the,

Speaker 2 21:52 – 24:19

like, opposite kind of rhetoric that comes even at the federal level. Maybe in some state governments or local where where there are calls to do some kind of wholesale a wholesale kind of, binding of different types of content online. Right? Which then brings the assumption that we need some kind of automated tool. Right? So it's like, to carry it even further, what would be something that, for example, that we would definitely oppose is some kind of automated content filtering law. Right? Given all of the kinds of risks that that that you were pointing out earlier. So so that is some that is something that that policymakers, yes, should should really consider restrictions on these things, as opposed to going in the other direction saying, you know what, your, these companies need to ban some of these kinds of content on a, on a large scale and therefore use these kinds of tools without really being careful about it. That that is some, I think we need to be concerned about. And with companies, I think, you know, based on what we're said here, I think we'll have to go back to basics and companies need to, include opportunities for human review and bring people back into the equation, basically, when it comes to, like, you know, moderating content and kind of be more as you you mentioned about transparent and be transparent about hold out whole whole human moderates as we work with these automated tools, to to to protect people's rights. You know? I I also think there are different narratives here that we have these conversations around the risks involved in fish or clinician technologies, AI based tools like that, and so on. But companies within, like, private sector or company forums, if you will, there is, like, this kind of, pitching or selling of, machine learning tools, these technologies, as being really, really accurate and really, you know, are we going to solve a lot of problems without any kind of, hesitation? So I worry about these kinds of conversations that are happening, you know, in more like, you know, industry, conferences and so on where where there's less, less criticism, if you will. So, yeah, I think companies, governments that need to be a way more critical of of these tools.

Speaker 3 24:19 – 24:26

So to close this out and and Jasmine, I wanna start with you. Just any final thoughts on anything we discussed today?

Speaker 1 24:27 – 26:35

Sure. I think, you know, one of the underlying things we've been talking about is about how we communicate about these kinds of tools, whether it's in the kind of automated content analysis realm or machine learning predictive tools as a whole. But the conversation has to be such that just regular people understand what's going on. Because it's one thing to say, oh, we're using artificial intelligence. And, you know, for many of us, the rhetoric or framing surrounding artificial intelligence has been connected to, you know, science fiction and Star Trek and all these kinds of things that or maybe even kit. Right? Knight Rider. And, that those all seem good, but, like, the actual uses of machine learning tools, the outcomes are quite different than being able to beam somebody up. Right? Or the outcomes are quite different from, you know, your car talking to you and there it being no big deal, but that this is actually affecting lives. And so being able to adequately communicate about what's happening and so maybe that's a that's a media and text press or just regular press kind of thing, but also, a educator kind of thing and researcher kind of thing and advocacy organization kind of thing, which is that we need to be able to communicate, adequately and accurately about what the implications of these tools are for everyday people or just regular folks, and then what we could possibly do about this and how we could, make sure that the impacts are mitigated or stopped completely in connection to these, you know, very powerful tools that are continuing continuously, you know, innovative and moving into several different realms. And so I think that's a really important thing that has to happen with respect to to these kinds of techniques and tools.

Speaker 3 26:35 – 26:38

And, Anuraj, any any final thoughts on your end?

Speaker 2 26:39 – 28:19

Yeah. I I, you know, I completely agree with what Jasmine is saying. Maybe what I'll add is that I going back to this point about, the perceived objectivity of these kinds of tools, and I suppose it links to this larger problem of how we view technologies as being very helpful, you know, for for social, economic, or other purposes. And they can be. Right? Technologies of of of, you know, may we have made a lot out of of of advances based on different kinds of technologies. But, I think what we keep in mind when we talk about these kinds of automated tools is that their intent what what they're intended, you know, to do. And if it's simply to just replicate what's already exist, which is often the case based on the kinds of data that these models are trained on or the the ways they're applied, then that then that really falls short. But if they're intended to to actually, you know, promote some kind of positive change, and that's a different thing. But but just design a design a technology to solve the problem without considering, like, what's the ultimate, like, impact here or intention, won't read or won't really lead to any kind of positive chain. I think I think we have to consider that, you know, and how what what are the purposes of these tools? And that comes into that critical conversation when we when particularly for governments and companies and others to to consider, you know, from with a critical mind, what what are what is gonna be the impact of of these kinds of automated tools.

Speaker 3 28:24 – 28:36

Jasmine and Dhanaraj, thank you so much for being here today. If you would like to find out more about CDT's work, please feel free to visit us at cdt.org. I'm Jamal Magby, and thank you for talking tech.