Episodes
Audio
Chapters (AI generated)
Speakers
Transcript
Big Data Skepticism w/ Kate Crawford
Enrico: How was your September? Good. Semester has started. I went back to my old university to teach cognitive science students. Kate Crawford is our guest for Data stories number 27.
Enrico BertiniHi, everyone. Data stories number 27. Hi, Moritz.
Moritz StefanerHi, Enrico. How you doing? You sound very energetic.
Enrico BertiniYou think so? Well, you know, I go with the weather. The weather is not that good.
Moritz StefanerYeah, yeah, I know what you mean.
Enrico BertiniYeah. Hey, it's been a long time, man.
Kate CrawfordYeah.
Enrico BertiniSince our last episode.
Moritz StefanerThat's true.
Enrico BertiniYeah.
Moritz StefanerHow was your September?
Enrico BertiniGood. Busy, busy, busy, busy. But good. I'm excited.
Moritz StefanerSemester has started.
Enrico BertiniSemester started. I have a few students working with me and it's fun. It's lots of fun. Just yesterday I came from, I gave a very interesting exercise in class where I asked my students to sketch visualizations by hand. And I gave them actually a dataset and a few questions coming from a research paper. And after they provided their solutions, I showed them the solution from the researchers.
Moritz StefanerOkay. The original graphic.
Enrico BertiniYeah, the original graphics. And that was fun. That was fun. Yeah. My students are good, actually. I'm surprised.
Moritz StefanerDid they do better than the original researchers? Hopefully.
Enrico BertiniI don't think they did better, but there were very interesting and stimulated discussions. So this kind of stuff, it's always good to. I mean, sometimes I have the feeling that I'm learning myself more than what I teach them. And it's fun. It's a lot of fun.
Moritz StefanerYeah. I was also just teaching, so I went back to my old university to teach the cognitive science students. Wow, some database. Yeah, that was really good. Yeah, yeah.
Enrico BertiniBack to the mothership.
Moritz StefanerExactly. Yeah. Spend a day talking about rainbow color scales, roughly, because, you know, they always have the fMRI data and the. Also the machine learning people, they always have these magic carpets, you know, these error landscapes, and it's always rainbows. So I think we spent a whole day just discussing these.
Enrico BertiniHave you, did you check this? Have you seen this very nice series from the guy from NASA.
Moritz StefanerYeah. Fantastic. We should link that. It's really.
Enrico BertiniI love that one.
Moritz StefanerAbsolutely.
Enrico BertiniI love it. Yeah, yeah. Well, anyway, we have another fantastic guest today. It's my pleasure to introduce Kate Crawford here today. Hi, Kate.
Big Data: Criticism and Skepticism AI generated chapter summary:
We invited Kate because we wanted to talk about big data, criticism or skepticism. She has a paper out titled Critical Questions for big data. Also has lots of jobs. Very busy, and I spend a lot of time on planes.
Enrico BertiniI love it. Yeah, yeah. Well, anyway, we have another fantastic guest today. It's my pleasure to introduce Kate Crawford here today. Hi, Kate.
Kate CrawfordHi, guys. How you going?
Enrico BertiniWe are doing great. So we invited Kate because we wanted to talk about big data, criticism or skepticism. And she's one of the leading figures out there talking about big data and what are the limitations and what kind of big questions we should also ask on top of big data. Right? So everyone seems to be super excited. Everything seems to be super, super good. But of course, there are also some weird aspects and maybe limitations and maybe too much hype around it. Right. So we wanted to. We invited her because she's really prominent, so she's been. She has a paper out titled Critical Questions for big data. She's been invited in a number of different places. I think you've been at Strata last year or this year, right? Yeah. And another interesting article in the Harvard Business Review, the hidden biases in big data and so on. She also has lots of jobs. Principal researchers at Microsoft Research, visiting professor at MIT, senior fellow at Information Law Institute at NYU here, where I am an associate professor at the University of New South Wales. Wow, Kate, how can you manage that?
Kate CrawfordVery busy, and I spend a lot of time on planes.
In the Elevator With Big Data AI generated chapter summary:
Professor Andrew Keen's PhD is in critical media studies. Keen is also a professor at Microsoft Research and MIT in the center for Civic Media. Keen talks about how social media can transform certain kinds of engagements and communication. These are the questions that are really driving his research at the moment.
Enrico BertiniSo we normally ask our guest to introduce themselves. Can you tell us a little bit more about your background, how you got here to the point of how did you get interested in big data and all the rest?
Kate CrawfordOh, it's a long story. Once upon a time in a land far, far away. So I'm a professor, and I've been a professor for over a decade now. Gosh, it's so actually twelve years. And I was originally based in Australia, and I was heading up my own research institute at the University of New South Wales, working with Catherine Lumby, who's a media researcher there. And I had been researching a lot of issues to do with how the Internet can be thought of as both social and cultural technology. And I've been doing that for a long time. And seeing the emergence of social media in the early days struck me as being very interesting. So I took that very seriously at a time when people weren't necessarily taking things like Facebook and Twitter seriously. And I found an enormous amount of potential and also a lot of questions around how social media was going to transform particular kinds of engagements and communication. And as that grew, I connected with a whole lot of other researchers internationally who, who are also interested in these kinds of spaces, including Dana Boyd, who was based at Microsoft Research. And she invited me over here as a visiting professor. So I spent some time collaborating with her here, which is where we wrote our first paper together, which is the six provocations for big data. That was back in 2011, which was a paper that we gave at the Oxford Internet Institute for their 10th anniversary. And then we started doing more and more research and published papers since, including the critical questions paper that you mentioned. And they invited me to come over here and to stay. And so in addition to being at Microsoft Research, I'm also a professor over at MIT in the center for Civic Media, working with Ethan Zuckerman and others, which is a fantastic group of people as well. And in terms of what draws me to some of these big data questions. I think it's the shift that happened really just in the last five years in terms of the kinds of human life that are now quantifiable, the kinds of data that we can scrape, that we can study and analyze, that represent the social graph, that represent the way we communicate with people, that represent our tastes and our preferences, and the way that we see the world. That's a pretty extraordinary shift. That's a different way of visualizing human interaction and human communication. And it also has a lot of flow on effects and a lot of serious questions, as well as an enormous amount of potential. I have also been doing a lot of work in crisis informatics, looking at how the way people communicate changes during a crisis event. And that's also been absolutely extraordinary to see how much of that we can now gather and analyze from a sort of data perspective. So through these kinds of engagements, I became somebody who was both doing big data studies, as they're loosely called. We can talk about that term and whether or not that. I have my own issues with that term, but it seems to have become the shorthand. So while I'm also doing these studies, I've been seeing a set of recurring problems, and I started to notice these problems, and I started to write about them and started to talk to other researchers about them. And we all realized that there were these, these patterns that weren't being talked about publicly, problems in the way that these kinds of datasets are being used. Some of those problems were essentially methodological, some were epistemological. They're about how they understand knowledge. Some were ethical. Like, should we be using these data sets the way that we use them? So these are the questions that are really driving my research at the moment.
Enrico BertiniThat's really interesting. I actually didn't realize that your questions came from your own practice. That's really interesting.
Moritz StefanerOne question. What did you study originally? Or was your PhD in? Or what's your background from that end?
Kate CrawfordYeah, well, I have degrees in law and in philosophy, and my PhD is in critical media studies.
Moritz StefanerNice. Yeah, it's a perfect combination. So before we talk about, let's say, the critical view, can we maybe sketch first what is like a data optimistic view or data positivistic view? Like, I mean, we all come from, I mean, we all work a lot with data and we all surrounded by data, and we find it very exciting. And just this article came back to my mind by Chris Anderson in 2008. It's called the end of theory, and basically writes like, every special science will be made obsolete by just data science, more or less. I think that's his, his main claim, that you don't need doctors, just people who are really good at crunching medical documents.
Post-Big Data: The Critical View AI generated chapter summary:
Big data fundamentalism is very much in its absolute zenith at the moment it's at its height. People are very excited about data cracking. But there are serious limitations. Thinking about those limitations, along with the hype, is actually really important.
Moritz StefanerNice. Yeah, it's a perfect combination. So before we talk about, let's say, the critical view, can we maybe sketch first what is like a data optimistic view or data positivistic view? Like, I mean, we all come from, I mean, we all work a lot with data and we all surrounded by data, and we find it very exciting. And just this article came back to my mind by Chris Anderson in 2008. It's called the end of theory, and basically writes like, every special science will be made obsolete by just data science, more or less. I think that's his, his main claim, that you don't need doctors, just people who are really good at crunching medical documents.
Kate CrawfordHe's got a great line in that article where he says that, who knows why people do the things they do? The point is that they just do it and we can track it, and that's all we need, because basically, with enough data, the numbers speak for themselves. And that has been one of those incredibly famous phrases that I think has galvanized this view that somehow numbers are enough and that correlation is just as important as causation. And that really, ultimately, data is more important than why questions the why we do things, why that data might be there and what the context might be. And I guess I ascribe those kinds of perspectives under the title of, like, big data fundamentalism.
Moritz StefanerLike a religion? Yeah, religion.
Kate CrawfordIt's like the articles of faith is that, you know, more data is better always, and that the bigger the data, the closer to objective truth you become. So I think that kind of fundamentalism is very much in its absolute zenith at the moment it's at its height. And people are very excited about data cracking. All of the great problems that we face, from climate change through to improving health systems, through to simply understanding why people do what they do. And I think there's a lot of questions that we need to ask about that enthusiasm. The enthusiasm is great, don't get me wrong. I mean, the reason I work in the lab that I work in, which I sit next to people who are experts in machine learning, I sit next to people who are doing extraordinary, extraordinary work in algorithmic game theory. I talk to people who are in theoretical physics. And the reason why a lab like this is so extraordinary is that we can talk about the strengths and weaknesses of the absolutely cutting edge emerging science in this field. And there are serious limitations. So thinking about those limitations, along with the hype, I think is actually really important because that's where we're going to produce better data research and ultimately better big data science.
What are the main problems in big data analysis? AI generated chapter summary:
Top three for me is this myth of objectivity, the possibility of algorithmic discrimination, and the enormous difficulty of anonymization. Questions relate to method, ethics and privacy, which is actually a really big problem. Both, both to the academic side of big data research and in the technology sector.
Moritz StefanerSo what are the main problems you see at the moment?
Kate CrawfordInteresting.
Enrico BertiniHow many hours we have.
Kate CrawfordLook, I'll just give you three that I think have been animating a lot of my thoughts and a lot of the papers and talks that I've been giving recently.
Enrico BertiniKate, before you go through, through these examples, through these items I'm just curious about are the ones that you're going to mention, those that come from your own practice originally, what you said at the beginning that you basically start discovering that there were problems while you were doing basically big data analysis.
Kate CrawfordI think that is certainly the basis of the first issue, which is.
Enrico BertiniOk, I was just curious about that.
Kate CrawfordYeah, yeah. I mean, certainly in the last couple of years, I've spent a lot of time with people who are doing what is loosely called big data research. And a lot of the work that's happening and a lot of the ways we're trying to make it better sit under these three categories. So some of it is animated directly by work that I've done and collaborators that I work with, and some of it is raised by things that I see in the industry much more broadly. And I'm referring both, both to the academic side of big data research as well as what I'm seeing in the technology sector. Yeah. So it's personal, and I think it's also. It's sectoral in that sense. So, yeah, look, the top three for me is this, first of all, this myth of objectivity. The second one is the possibility of algorithmic discrimination, and the third one is the enormous difficulty of anonymization. And somewhere between those three, you can see there's questions that relate to method, there are questions that relate to how we think about fairness and justice in big data, and there are questions that relate to ethics and privacy and how we're actually going to think about keeping these data sets secure, which is actually a really big problem. So you can see that there's a set of concerns there that have a spectrum through to the very nuts and bolts pragmatics of what kind of methods we're applying and what kinds of assumptions are behind those methods through the epistemological questions, what does this data actually represent through to the ethical questions of how we're using it and who it's for and how do we protect it.
Crisis data and the bias of Twitter AI generated chapter summary:
Crisis informatics uses a lot of social media data. But they have an inherent skew or inherent bias that leads towards this kind of urban, privileged experience of an event. How do you deal with that?
Moritz StefanerSo let's dive into the first one. Yeah, absolutely. So isn't it like that if you have more data, you have a better approximation of reality? I mean, has to be like that, right?
Kate CrawfordYeah. And I think for me, the real contact point for this came from when I first started looking at the way that social media data was being used during crisis events. So this was when I was based in Australia, and in 2010 and 2011, there was the largest ever flood in the history of Queensland, which is the, a northern state. And the floodplain was extraordinary. It was about the size of France. I mean, it was a huge floodplain and it was affecting large numbers of communities. And we were gathering the tweets and looking at the tweets and how people were sharing information, and there was some really fascinating things to be found in that data set. First of all, we were seeing how people were helping each other try to find and access the resources that were still available. So that was things like, which road is open? How do I get from this town to the next town? Which areas are possible through to which shops have milk and bread? Where can I actually get food through to some very interesting kinds of rumor crushing. So when other people were doing things like sending a picture of a shark that was in the waters of Brisbane before saying, okay, this is. This is obviously a fake. So all of these kinds of engagements are happening on Twitter in really interesting ways. But there was a problem at the same time, which is a vast majority of the tweets were coming from the capital city of Brisbane, which is actually not where the greatest damage was being experienced in the state. So even though the Twitter data, you have an enormous number of tweets about what's happening during the floods, they have an inherent skew or inherent bias that leads towards this kind of urban, privileged experience of this event, and certainly not the event at its most extreme. And this is important to recognize because, of course, crisis informatics uses a lot of social media data. It's a growing and important field, I'd say. So thinking about what that data can tell us, because it can tell us some very useful and productive things, and what it can't tell us is, I think, an important academic task right now.
Moritz StefanerYeah, yeah, yeah. And there's. I'm aware of a few studies that look at, like, typical demographics for, you know, the typical platforms, like, for Twitter. We know. I don't know. It's more people in the cities and more males than in the population and so on. The question is, how do you deal with that? Like, are there ways, are you aware of ways of sort of counterbalancing this bias, or should we just use certain data sets? Not in certain ways. Like, what's the best way out of that?
Kate CrawfordI think there are two ways out of it. One way that people like to talk about is sampling. Do we have a. That's representative? And I actually think that's extremely difficult to do, partly because of the fact that services like Twitter still have quite a small user base in terms of overall population of any particular country you might like to point to. And it skews young, it skews white, it skews more affluent, and these are difficult things to try and offset. What I tend to think is a good way to deal with that is to be very, very clear and upfront and explicit about what that data set does represent. So it's about in our research papers and also in the way those papers are reported on being very, very clear about, okay, these are the kinds of things we're getting. This data set is incredibly useful, but it's pertaining to these groups and these kinds of people. And that would be a huge improvement because right now what we see is a lot of reporting and sometimes even, although fortunately, a lot less sometimes research, when people say, well, this is what.
Moritz StefanerPeople, what the world thinks about, because.
Kate CrawfordThis is what we see, or on Twitter and not be further from the truth. Because in addition to the fact that the Twitter population, the people who are using Twitter are a very kind of small subset, even within that subset, you can ask a whole bunch of really mind bending questions. One of those mind bending question is what percentage of that data is bots? Because we know that there are millions of bot accounts on Twitter and they're all sending messages, and those messages are all getting to these data sets. And I speak to research scientists about this all the time. It's actually very difficult to extract those messages from your Twitter data set. So therefore, how do we even know what percentage of that data is human and what percentage is non human? Right. So that actually rates a whole lot of really interesting questions. I don't know if you saw that amazing sentiment study that came out recently. That said, looking at Twitter, we can see that people are saddest on Thursday night. I looked at that and I thought, well, does that also mean bots are saddest on Thursday?
Moritz StefanerRight? Yeah. The machines had a long week.
Kate CrawfordYeah. Is that for bots? Is it Thursday night? The minute you start asking these questions, you can actually really see some of these claims as being really broad and sometimes.
Moritz StefanerBut I can really relate to that problem because, I mean, first of all, if you do data visualization, whenever you make a world map or a United States map, you know, people read it as the whole, you know, it's very hard to express in a map or a diagram. It's just a small, you know, sample of things we're looking at because it always looks complete and it always looks like the bird's eye view. Right. And the second thing is, whenever you publish something on the web, the editor will want to have a nice, catchy headline. You know, it's like they don't want a relative clause explaining what exactly it was being measured in a headline. And, and I think that's a deep problem at the moment that everything has to be blown up to be the super surprising statement about everything at once.
Enrico BertiniYeah. And I think on top of that, I think whatever kind of information that is supported by some data as this kind of aura of truth and objectivity. Right. I think that's another problem. I think people are more persuaded when you show hard numbers to them. But we know that having numbers doesn't really mean that this is the truth. Right. So that's a big issue here as well.
Kate CrawfordWell, Enrico, that's exactly right. You said it brilliantly. That is absolutely the problem of this myth of objectivity. And once you times that kind of underlying perception that somehow numbers equal truth, you times that by a factor of n until we get to big data, then you can see how easy it becomes to just feel like, well, if it's big data, it's obviously true because you have a very large number of very large data set attached to it. Right. So it's completely understandable that we, that we are in awe of big data studies because they seem to have so much data that. How can that possibly be wrong? Unfortunately, they are quite often wrong, as we've seen in cases like the Google flu trends example earlier this year, where they wildly mispredicted the number of Americans would experience flu in a season. Now, they had enormous data, and if you think about the, the amount of search engine data that Google has, and it's just extraordinary. And yet they can still make these really large errors. So having that skepticism, I think, is powerful as researchers, as thinkers, as designers, because it makes us able to create more nuanced pictures, more nuanced designs, more nuanced studies now completely agree where it's, we can't change the way the media is going to report on that. I actually think reporters are getting more and more literate about data, too. But, you know, to be honest, we can't change the way a headline is going to be put on an article that reports on our study. But our studies can be very, very careful and we can actually be extremely nuanced. And I think it's really, I think that's a responsibility that we have as scientists to do that. Yeah.
Moritz StefanerAnd at least if you put out a graphic, you know, you can have a catchy headline, but at least have a sub headline that explains exactly what we're looking at and how it was gathered in which timeframe, from which platform, and not just say, the mood of the nation or something like that. Sorry to whoever put that out, maybe.
On the Problem of Science's Objectivity AI generated chapter summary:
As new tools emerge in history, science responds, and it becomes connected to how science depicts objectivity. Big data is starting to shift our idea of what good science looks like and what objectivity looks like. To recognize the limits of that tool is going to be essential, or we're going to make a lot of mistakes.
Enrico BertiniYeah. I think this nicely introduces another problem that I always see. That is the problem of literacy, right? I think the data literacy that people have, or even statistical literacy that people have. I mean, I'm sure that most people just take things for granted when they see some numbers and just don't realize that these are just numbers, right? We are not even, I think to some extent, we are not even trained at school to criticize numbers because numbers are assumed to be the truth, right? And we are taught that science is the objective truth, which is actually not totally true in my view of the world. I mean, even removing all the biases with been talking about all the sampling problems, not necessarily a scientific experiment tells the truth, right? It's just one single experiment. And if you look at how science progresses, you make progress only when you have a series of positive results. Right? And even then, it's a very long and complicated process.
Kate CrawfordI think that's right. And I think it's actually fascinating to me to look at the history of this concept of objectivity. There's beautiful book about objectivity written by Lorraine Dustin and Peter Galison, which looks at how recent the idea of objectivity is in science. It really isn't a very old idea, this idea that we could actually be completely objective. It's far more recent than you might imagine. And it comes hand in hand with a set of technologies like the camera and like the microscope, where we started to say that these kinds of instruments allow us to remove the human and produce this kind of mechanical objectivity. And what's so interesting about that is as new tools emerge in history, science responds, and it becomes connected to how science depicts objectivity. And with what's now being called the computational turn, this turn to the capacity to have very large computations and very large data sets, that is a tool as well, that is starting to shift our idea of what good science looks like and what objectivity looks like. And that's actually a radical shift in what research is and how we understand truth. And this is why epistemology is really at the heart of this. It's about how we understand the very definition of knowledge. And as part of this computational turn, we're starting to see big data as being how we get to that sort of next level of knowledge and reflection. And I think it certainly can be a very powerful tool. But to recognize the limits of that tool is going to be essential, or we're going to make a lot of mistakes.
Enrico BertiniGood point. I was actually thinking. So when you just going back to one of the things that you were saying before about asking the right questions where you are confronted with a new data set. I'm sure most of many of us, including myself, have been going through some new data sets without even asking myself what kind of biases are there. So, do you know if there is any source of somewhere that actually tells you, look, if you have a new data set in front of you, you should at least ask yourself this kind of questions. That would be kind of like a cookbook or like a checklist. Like you have a new data set. First thing to do is this, this, this and this, right? I would love to have something like that.
Kate CrawfordI love this idea. I think we should design it like a color wheel, that we could just kind of. Do you have this kind of data? Oh, then you want to ask these kinds of questions first. Do you have this kind of data? You can just rotate the color wheel around. Is it, is it mobile data? Is it social media data from Twitter? Is it from Facebook? Is it from city sensors? You know, you could get some really, you could get a hilarious little, you know, a little heuristic tool that we could use. I think we should market that. What do you reckon? We'll design it and we can actually release it and see if people would use it.
Enrico BertiniSee how things happen in this podcast. Should we move on to the second point? I think you said discrimination, right?
Algorithmic Discrimination AI generated chapter summary:
Algorithmic discrimination is the way in which we're using data sets to essentially categorize people into ever more precise categories. We see entire new industries being set up. The trick is just to behave so erratically that you don't fit into any category.
Enrico BertiniSee how things happen in this podcast. Should we move on to the second point? I think you said discrimination, right?
Kate CrawfordAlgorithmic discrimination. Yeah. This is a term that's being used to look at the way in which we're actually using data sets to essentially categorize people into ever more precise categories. So it gets interesting. People have been making the claim that somehow, because when you're working with big data, it's so abstract, you have these upsets. It doesn't function at the level of group based discrimination, but it's actually the opposite. If we look particularly at the way that marketing is using big data, it's to try and put you into these categories of what's your gender, what's your race, where do you live, what's your age? And then, much more precisely, what do you like to eat, what do you like to drink? Where do you go out at night? What are your entertainment tastes, what are your political preferences? And we see entire new industries being set up. Companies like Acxiom, who sort of sit in this space of third party data brokers, who are amassing enormous amounts of data about individuals actually ascribed to individuals and putting them into these kinds of marketing terminologies around. You can sell this person these kinds of things. Now, that's fine. Up to a point. A lot of people aren't really that concerned about how things get marketed to them. But it's interesting if you put this into a different kind of historical context. So we could go back to American history and you could see where we saw the emergence of redlining, where people who are living in traditionally African American poor communities were not being offered credit loans and banking loans. And this was a serious enough form of discrimination that we saw federal legislation being passed to prevent it. But what's interesting now is that we still have that legislation. Redlining is still illegal in the offline world. But online, if you choose to show your ad for a particular type of credit offer only to this particular type, who has these characteristics, who has this kind of a bank balance that makes them very attractive to banks, you can actually be very precise in determining who will see that. And perhaps somebody who really, really needs that credit offer or really needs that loan will never even see that it exists.
Moritz StefanerRight. And I mean, Facebook, the whole advertising on Facebook is exactly that principle. And everybody loves it because it's so targeted.
Kate CrawfordAnd although I have to, I don't know if this works for you, but I mean, for me, my Facebook, untargeted, I mean, I can't tell you how far away they are.
Moritz StefanerNo, the trick is just to behave so erratically that you don't fit into any category.
Enrico BertiniIt had to be a game, actually.
Kate CrawfordLike, maybe that was to it.
Moritz StefanerBut also in Germany, there's at least the rumor, and I believe there's something to it, that your, like where you're calling from when you call a hotline, for instance, or, you know, like, I don't know, you're calling your telecommunications provider or so that there's a scoring system that will decide how long you have to wait based on if you come from a rich quarter or a poorer one.
Enrico BertiniWow.
Moritz StefanerYeah. Because, I mean, yeah, in the end, you want to, like, comfort the richer clients because they will pay more in the end and so you make sure they get, receive better support. And it's sort of an. Yeah, it's a rumor, but a one that seems to be quite well founded.
Could Facebook Know How Sick You Are? AI generated chapter summary:
In the US, we have HIPAA, which is the act that essentially is there to govern health data. But all of that data is completely unprotected by HIPPA. It can be sold to third party providers, and it might be an accurate signal, but it also might be inaccurate.
Enrico BertiniOne question I have then is, under the current legislations that we have in our countries, do the existing legislation prevent doing that or not? I mean. Sorry, Kate, say it again.
Kate CrawfordNo, certainly not in the US.
Enrico BertiniOkay.
Kate CrawfordThere are some, it's interesting to see what's happening in the UK and in parts of Europe, but certainly in the US. No, this is absolutely very common. And what's so interesting, too, is when you start to think about this in the health space. So health data has often been highly regulated. It's been seen as something that, and I think rightly so, should be really protected, and it should have a whole lot of protections for consumers around their health data. But in the US, we have HIPAA, which is the act that essentially is there to govern health data. But if you get sick, for example, what's the first thing you do? Well, you're probably going to open up a search engine. You're probably going to type in your symptoms and see how serious is this? Or what? Should I take all of that data?
Enrico BertiniI no longer do that. Sorry for interrupting you. I get so paranoid.
Moritz StefanerToo scary.
Enrico BertiniYou get so paranoid about that, and it's crazy. Sorry for interrupting, but I'm with you.
Kate CrawfordBut I still do it, even though I know that I do it and I kick myself because, of course, you read this.
Enrico BertiniIt's so damn thing.
Kate CrawfordBut it's an incredibly common reflexive habit that is absolutely widespread. And all of that data is completely unprotected by HIPPA. There's nothing that protects you there. Say, for example, you buy an e book about cancer survival, or you like a page for a disease foundation on Facebook. These are very interesting signals about your health, your current state of health that can be mined and can be used to build a picture of you. And that is completely unprotected data that can be moved through any particular service. It can be sold to third party providers, and it might be an accurate signal, but it actually also might be an inaccurate signal. And what's so interesting is that in either of those situations, it can actually still be quite dangerous if that has any kind of. That ends up getting connected to your health insurance premiums or to whether or not somebody decides, you know, do I want to give this person a job or do I actually want to rent my house to this person?
Moritz StefanerThat's a horrible vision. If the interest in something is a problem, you know, it's like, what world is that? You know, like, regardless of, you know, how it's connected to your well being or not, the fact that your sole interest in something is something that people can judge you for automatically is a horrible thought.
Kate CrawfordI think that's exactly right. I mean, the study that I think is really interesting is the one that was done by Cambridge University looking at the Facebook likes of 60,000 people. And they used those likes to develop a model that could predict very sensitive personal information about users, including their sexuality, their ethnicity, their religious views, and even if they're a previous user of drugs.
Moritz StefanerAnd alcohol, but just in an indirect way. So if you listen to these types of bands or you like these types of movies, then suddenly they can predict that.
Kate CrawfordThat's right. And you'd think, well, you know, just how accurate can that really be? Well, what was really interesting about the study is they showed you the accuracy in the end. And in the end, they were really good at categorizing whether you were caucasian or African American. They had around 95% accuracy there, which is pretty startling, followed by gender. They were very good at predicting gender. Then male sexuality, apparently female sexuality is a lot harder to predict. It's way down the list. And then even, even lower than that was political leanings, which is interesting because I would have presumed that would have been higher you to predict.
Moritz StefanerYeah, but the interesting thing, there's a fine line because, I mean, in principle, like, you know, personalized advertisements or targeted advertisements, I mean, why not? I mean, if I like, you know, these three movies, and then I get in, you know, a banner for a fourth movie I actually like, why not? That's cool. But. So in your view, where is the creep line? Or where is the unethical? Yeah, where's that border to be drawn? Is there any rule of thumb there? Or any ethical and a hard ethical rule?
Where is the Creepy Line? AI generated chapter summary:
Where is the creep line? Or where is the unethical? Where's that border to be drawn? Is there any rule of thumb there? Or any ethical and a hard ethical rule?
Moritz StefanerYeah, but the interesting thing, there's a fine line because, I mean, in principle, like, you know, personalized advertisements or targeted advertisements, I mean, why not? I mean, if I like, you know, these three movies, and then I get in, you know, a banner for a fourth movie I actually like, why not? That's cool. But. So in your view, where is the creep line? Or where is the unethical? Yeah, where's that border to be drawn? Is there any rule of thumb there? Or any ethical and a hard ethical rule?
Kate CrawfordYeah. Well, this is what is so interesting, because I feel like this test, that, of course, was famously coined by Google, don't be creepy. The creepy test is completely unknown. Now. We don't really know where the creepy line is. If we think about some of the things that that Google is doing. I think I would put officially into the creepy category. In fact, just yesterday I was in the ladies bathroom at MIT and somebody was wearing Google Glass. I was there. This is happening. A line has been crossed.
Moritz StefanerThat is creepy indeed. You're right. And in two, three years, maybe it's not creepy anymore because we're just getting used to the creepiness.
Kate CrawfordThis is what's interesting. So I think in addition to the test being obviously completely, completely blurry and opaque, our sense of what is creepy is changing all the time. So it's actually a really bad heuristic to use, and I think it's failed us too many times for us to even take it seriously. And what was really interesting about that Cambridge University study is that one of the points the researchers made is that they said they were really worried that this kind of Facebook data, it's not going be used just to market things to you. It's going to be used by employers and by government agencies to discriminate against individuals. But even more worryingly, you won't even know that your data has been used to make a discrimination about you, about whether you'll get a job or about whether you'll get health insurance. You would never know that. In fact, your Facebook likes have been used as part of a big data determination. And whether it's right or wrong, that could actually have a very serious impact on your life. That's, this gets really interesting, and it's not really a science fiction scenario. I mean, I like to sort of think about, you know, just how far this can go, but these sorts of things are already happening. I mean, a lot of these studies are very much on the cutting edge. They're sort of within, you know, one or two years of what people have been doing. But even so, we're starting to, to see an uptake of these kinds of technologies as a way to determine what people are like. I mean, obviously, employers are already using things like Facebook to assess people before they hire them, but this is getting well beyond that point. I mean, if you're looking at making predictions about somebody in order to determine whether or not you're going to rent a house to them or give them health insurance, that has some very, very serious implications. We have no way to actually moderate that.
Moritz StefanerI mean, we always extrapolate, I think, from incomplete information. Right. So I don't know. So you want to employ somebody, you talk to them, like, for half an hour and you read their cv. So you would always also look at their clothes and see, like, you know, how they shake your hand and, you know, that's what we do.
Kate CrawfordAnd I think that's a really, that's a really good example because you get to participate in that. You get to. Exactly.
Moritz StefanerAt least you have a chance, like, to make a difference cv, you know.
Kate CrawfordBut this is a situation where you will not even know. And those kinds of predictive modeling processes. In some cases, I actually think predictive modeling is actually really primitive. In some cases where it's, it's really, it's really not that great yet. So you've got this very coarse grain data which has been used in ways that sometimes can be quite accurate, sometimes it isn't, and you're even aware that it's part of the mix. So that, that's what I think is the ethical question that we need to.
Moritz StefanerSo what do we demand? Like that, for instance, if there's, like, a company using that data, should they be. So if they have some, let's say a credit company, like a bank, you know, do they have to open up the black box of the algorithm and specify exactly which attributes they use? Would that be a solution?
Kate CrawfordI think not. And this is where it gets really interesting, because, of course, at the level of the algorithm, in many cases, even the designers of the algorithm can't tell you how it's worked. That's so interesting. I mean, you know, if you talk to people at Amazon, if you talk to people at Google, if you talk to people at Microsoft, they can, you know, they will tell you that there are so many algorithms and they're doing a whole lot of things that are by no means easy to point to. But what I think we can do is we can go further down the chain and we can look a lot closer to when a decision is being made. So if a bank decides to change your credit rating, we already have regulation to allow you to see how that credit rating functions. Imagine applying that to big data. Imagine if you're going for a job and you have the right to ask, have there been, is there, is there a sort of a big data component to the hiring? Can I see my file? And that could be something that you could say, yep, we've hired this company. This is what they've told us about. You feel free to correct any of this if it's incorrect. That isn't that far away from being possible. That is something that we could actually see as being part of a due process structure. I think at the level of the algorithm, at the level of the data generation, it's too hard. I think at the level of collection, it's almost impossible. Yeah.
Moritz StefanerRight. You would probably need a PhD in machine learning just to understand what algorithm they actually used.
Kate CrawfordYeah, but I think when decisions are getting made, we can actually look at the point of determination and say, what can we do?
Moritz StefanerYeah, it's a great point. Yeah. And again, this chance to sort of interact with a real human and put things straight, I think this whole data machine makes a decision that changes your life is just not right. And I actually know because, so we were buying a house two years ago, and so I was exactly in the situation that a few banks would just turn me down without even talking to me just because I don't fit in some category. So, and it's really upsetting. I can report personally, it is.
Kate CrawfordAnd I think banks have been doing what we call loosely, big data for a long time. Like, in many ways, the people where I locate the prehistory of big data is with the financial sector and it's with intelligence. It's with organizations like the NSA, and it's actually with climate data as well, areas where very large datasets have been collected over a long period of time and have been analyzed for particular purposes. But we're seeing whole new sectors start to do very similar sorts of things, those sorts of determinations about something that banks have been doing for decades. So banks are regulated sometimes in terms of those processes, and at least we have the capacity to have some visibility into things like credit scores, but that's not the case in all of these new industries, and that's what more concerning to me.
How to Identify People From Anonymous Data Sets AI generated chapter summary:
An interesting study shows how many points of spatio temporal data we need from an anonymous cell phone set to identify an individual. If it's only four, you want to get 95% of individuals, you know, actually identified from a data set. It's not by any means an easy process.
Moritz StefanerSo shall we move on to the last point?
Kate CrawfordYeah, I was just enjoying that, the pause.
Moritz StefanerYeah. So the third one was anonymity.
Kate CrawfordSo anonymity is really interesting because we have entire sort of fields of expertise of people who are working on both how to anonymize data and then how to re identify people from anonymous data sets. And it's actually extremely difficult. And back in the 1930s, one of the founders of forensic science, a man called Edmund Locard, wanted to find out how many points on a fingerprint would be required to identify an individual. And he discovered that, in fact, it was twelve points. You could take twelve points of someone's thumbprint, you could pretty accurately identify that individual. What we found out from a very interesting study this year is how many points of spatio temporal data we need from an anonymous cell phone set to identify an individual. So those spatio temporal points, where are you standing and making a call? Where are you in place and what time of day is it? So have a guess how many of those spatio temporal points you need to uniquely identify an individual.
Moritz StefanerI think we both know the study.
Kate CrawfordYou know, this is what's so amazing that if it's only four, which now we know that it is, you want to get 95% of individuals, you know, actually identified from a data set. That's pretty extraordinary.
Moritz StefanerI mean, that's so low, the number. Yeah, it's horribly low. I mean, it couldn't be lower. You know, it's like, you know, two or three is like basically impossible, and four is like this.
Kate CrawfordYeah, well, yeah, I mean, it's interesting. They found that if you, if you had, if you went lower, if you sort of went to two or three, you could get 50%.
Moritz StefanerOh, wow.
Kate CrawfordSo actually. And still have, you know, you don't get the 95% accuracy, but you can still identify a lot of people.
Moritz StefanerRight.
Kate CrawfordSo that study, that nature that came out of nature is, is extremely, extremely important because it tells us a couple of things. It tells us how important metadata is. It also shows us that the way that we walk through cities, the paths that we make are unique, that we're highly identifiable and that we're creatures of habit. You know, there aren't that many people who. Who live in your house and work in your workplace and take the same paths between those two places. And that's something really fascinating and I think quite lovely. But it's frightening when you think about how identifiable that makes.
Moritz StefanerAnd the same happens when you bring two data sets together. So it's very hard to maybe identify somebody from their music taste, but if you also know about their food taste, suddenly it becomes a unique combination or something like this.
Kate CrawfordOh, no, please.
Enrico BertiniI'm just wondering, does this mean that any given company, once they get these four data points, they are able to identify you or what? I just want to understand. I understand exactly what's the implication of that? What's the meaning?
Kate CrawfordYeah, I mean, it's interesting because I think that the study is very narrowly defined and they were very particular about how it was done. It was using 1.5 million people's cell phone records in an unnamed European country and they had this data set. It was an anonymous data set and they wanted to see how, if you had these spatio temporal points, how many you needed to actually identify individuals in that set. Now, whether you're going to go through the process that they went through to try and identify people is a different question. I think it's not by any means an easy process. And this is what's so interesting about reidentification research. It's not like you can just press a button and easily identify people from an anonymous data set. The issue is more that it is technically possible and that if you wanted to put in enough time and effort that you can do this.
Moritz StefanerYeah. And you have sort of these cascades. So once you have somebody identified, you know, and have their whole communication history from that previously anonymous data, but you sort of, you found your way into that, then you suddenly know much more about the person that also unlocks maybe other data sets. So I do think that there's a very, very real threat there also that. Yeah. If once you get your hands on such a large dataset and can identify people, that you can do a lot with that information.
Kate CrawfordYeah. And it's interesting because I think certainly in recent years, even if we just look, five years ago, it was actually pretty hard to get that kind of data. Now it's a lot easier. So I don't know if you saw. But in the July this year, at and T quietly decided to change its privacy policy and start selling its aggregated, anonymized, quote unquote, customer records. So you can go. If you're an at and T subscriber, unless you go and specifically tell them that you want to opt out, then your data can be sold in these massive data sets. And very few people know that that change was made. And therefore, very few people will actually take the effort to opt out and get in touch with at and t. So they're in this data set, and now Verizon is doing the same thing. So it's interesting to think about the fact that these data sets are becoming public. Now, obviously, they cost money, so they're not free, but they're going to organizations that have a vested interest in knowing a lot about who those individuals might be, and now they have that data. So that raises interesting questions. Now, by no means am I kind of thinking that we can roll that back. That's something that those companies can choose to do. But that does raise questions for us as researchers in terms of, well, what does that mean in terms of how we want to use those data sets, how we want to compare and bring data sets together, because we know how intimate and how sensitive and how revealing those data sets can be.
Enrico BertiniYeah, that's interesting. So every time I talk with somebody about the issues with data privacy, my very personal feeling is that. So I'm honestly wondering, are there any chances that we can really make progress? I mean, honestly, sorry for the negative view, but I mean, another option is just to say, look, I mean, I. That there's no way to have private data. Right. And we better educate ourselves that whatever we put out there is just public. Right. I mean, that's another take on it. A completely different kind of view.
Kate CrawfordOh, yeah. And it's interesting. And we could take that view as a thought experiment, and we could run with it. We could say, okay, everything is public. Your phone records, your browsing history, everything that happens on social media, your location in the city, who you go to, where your doctor is, what kinds of medical practitioners you're seeing. Let's just make all that public. That is definitely something that you could imagine as a kind of a worldview. Now, for a lot of people, that's not going to matter if you're somebody who's in perfect health, if you're somebody who lives in a country that has free healthcare, if you're somebody who doesn't look at anything that might possibly imperil the way that people would think about you or employ you, then that's totally fine. But what's interesting is that that kind of data can be used in very prejudicial ways. And what worries me is less those kinds of people, and more the people who are already vulnerable, who are already in communities who already have less power, because they're the ones whose data is going to be used against them. And this is what we're already seeing in particular ways. It's vulnerable communities who are already seen as subjects of surveillance. I think that we have no privacy. Get over it. Argument is really great for people who are already privileged, and I think it's actually not so great for people who are less privileged. And I sort of feel like there's so much we can do. I don't want to sound like a Pollyanna and to be overly optimistic, but I feel like there's an enormous amount of things we can do. I think we can do a lot. To be thinking about strengthening data ethics. That's something I really care about. I mean, at the moment, if you look at the ACM and the IEEE ethics guidelines, they're both 20 years old. I mean, the things that you could do in computer science in 20 years have changed so radically, and the ethics guidelines haven't been updated. So, I mean, things like that are really important. I also have an enormous amount of faith in pedagogy and teaching and how we actually train the next generation of data scientists. If we have a really good sense of how to use data ethically, that becomes part of how the profession understands itself. And that's actually really important, too. And then I think, thirdly, the thing that, which gives me really hope is thinking about due process, which we talked about earlier. So if data is being used against you in particular ways, or being used to make determinations that affect your life in serious ways. And I don't mean marketing. I mean, this is, this is much more about jobs and healthcare and getting a house to live in. At that level, I think we should have a right to see the data that's being used against us. I think having some of these are actually policy tweaks. I think there's also the level of social norms. So we could think about, we're going to start getting a lot more careful about the kinds of data that we make public, and we might start doing things like opting out particular types of data collection more than we used to. I think that the social level is important. The legal level is important. I also think the technical level is important. There's going to be some amazing new technologies that are going to help us protect our data in particular ways that are on the horizon. So somewhere between those three things and obviously those three things all working together, I have an enormous amount of hope that it's not over and that we haven't reached a point where everything is going to be public, because I think that could be actually, that wouldn't be a very nice future to be living in.
Top Five: Data Ethics AI generated chapter summary:
I think we can do a lot. To be thinking about strengthening data ethics. If we have a really good sense of how to use data ethically, that becomes part of how the profession understands itself. There's going to be some amazing new technologies that are going to help us protect our data.
Kate CrawfordOh, yeah. And it's interesting. And we could take that view as a thought experiment, and we could run with it. We could say, okay, everything is public. Your phone records, your browsing history, everything that happens on social media, your location in the city, who you go to, where your doctor is, what kinds of medical practitioners you're seeing. Let's just make all that public. That is definitely something that you could imagine as a kind of a worldview. Now, for a lot of people, that's not going to matter if you're somebody who's in perfect health, if you're somebody who lives in a country that has free healthcare, if you're somebody who doesn't look at anything that might possibly imperil the way that people would think about you or employ you, then that's totally fine. But what's interesting is that that kind of data can be used in very prejudicial ways. And what worries me is less those kinds of people, and more the people who are already vulnerable, who are already in communities who already have less power, because they're the ones whose data is going to be used against them. And this is what we're already seeing in particular ways. It's vulnerable communities who are already seen as subjects of surveillance. I think that we have no privacy. Get over it. Argument is really great for people who are already privileged, and I think it's actually not so great for people who are less privileged. And I sort of feel like there's so much we can do. I don't want to sound like a Pollyanna and to be overly optimistic, but I feel like there's an enormous amount of things we can do. I think we can do a lot. To be thinking about strengthening data ethics. That's something I really care about. I mean, at the moment, if you look at the ACM and the IEEE ethics guidelines, they're both 20 years old. I mean, the things that you could do in computer science in 20 years have changed so radically, and the ethics guidelines haven't been updated. So, I mean, things like that are really important. I also have an enormous amount of faith in pedagogy and teaching and how we actually train the next generation of data scientists. If we have a really good sense of how to use data ethically, that becomes part of how the profession understands itself. And that's actually really important, too. And then I think, thirdly, the thing that, which gives me really hope is thinking about due process, which we talked about earlier. So if data is being used against you in particular ways, or being used to make determinations that affect your life in serious ways. And I don't mean marketing. I mean, this is, this is much more about jobs and healthcare and getting a house to live in. At that level, I think we should have a right to see the data that's being used against us. I think having some of these are actually policy tweaks. I think there's also the level of social norms. So we could think about, we're going to start getting a lot more careful about the kinds of data that we make public, and we might start doing things like opting out particular types of data collection more than we used to. I think that the social level is important. The legal level is important. I also think the technical level is important. There's going to be some amazing new technologies that are going to help us protect our data in particular ways that are on the horizon. So somewhere between those three things and obviously those three things all working together, I have an enormous amount of hope that it's not over and that we haven't reached a point where everything is going to be public, because I think that could be actually, that wouldn't be a very nice future to be living in.
Moritz StefanerI think so, too. No, and it really, I mean, that's the whole surveillance thing is not the direct effect most people have, because there is no direct effect for most people. But the way you change your behavior once you know, you know you're potentially being monitored, that is already the problem, more or less. Yeah. Yeah. And I mean, I mean, many of our listeners are working with data, and I think one big takeaway is also to always look one step beyond that data set you just have in your hands and just think about, like, okay, given that data set, what is the rest? So what is exactly the part I'm missing out on? Because I'm just looking at that one piece. Is there like, a wider data set I could relate to, or at least check my data against, like, if it has the same distributions and demographics, or if I'm missing out on a certain part of the population? So I think it's often just a mental thing that you think of that possibility at all, that your data might be incomplete.
Kate CrawfordI think that could be part of the color wheel, too, right?
Moritz StefanerExactly. Make one more step out of the box.
Kate CrawfordNow another one who in this data set might be, could be seriously disadvantaged or damaged if you release this kind of thing. There are a whole lot of really useful questions that we need to ask ourselves as people who make and design data systems. It's an extraordinary job, extraordinary capacity, and the stuff that we can do with data is extremely exciting. But we also have these questions that we need to ask ourselves. And it's funny, there was a symposium at Harvard recently with reidentification researchers. So these are people on the cutting edge of computer science who are trying, who are looking at data sets and saying, right, if I want to try and look at this in such a way that I could re identify individuals, how would I do it? And, you know, it's an entire research field. What was really interesting is people in that field now starting to ask questions like, I've discovered this huge vulnerability. Should I publish it? Because it's going into the hands of people who are going to be using this for reasons that are really, really suspect. You know, is that worth me just adding another paper on my cv? So it's interesting that those conversations are happening right now because people know that, well, in some cases, you can't put the genie back in the model. I mean, once we, once we've found ways to identify people, that's going to have ramifications. And that's a pretty extreme example. But I think in even the kind of everyday examples that we might have in working with different kinds of data sets, you just ask a question of. Right. So what are the implications of this once it's out? And some of the time that might mean that you don't necessarily release that technology or you don't necessarily make that data set public, or you have some kinds of. Of ways of thinking about who's going to be implicated in that data. These are actually really good questions to be asking. Not everything has to be released. Not everything should be. Should be made public. Yeah.
Enrico BertiniI think related to that, we, people working in data visualization or similar fields, we have a kind of bias towards data rather than problems. I observed this kind of problem many times. I mean, our work starts with, oh, I got some data, let's work on that.
Moritz StefanerLet's see what we can do with that.
Enrico BertiniYeah, but if you think about how good, I mean, good science first and good analytics works, everything starts from a good question. Right. And then you ask yourself, what's the best way to answer this question? How can I get the data that I need to answer this question? And then there is a very long process working with the data itself, trying to gather, either try to see if there is any data out there that can be helpful for your question or generate the data on your own. Right. So think about, I mean, I remember reading about experiments in political science. Most of these people start with a big question and then go out and see and try to find a way to gather the data they need. Right. And sometimes this means literally going out and gathering these data. And once you do that, it means that you are part of the process of creating the data itself. And this is so much more powerful.
Moritz StefanerAnd then you're much more sensitive to. You're aware of the decisions you made on the way to end up with that specific data.
Enrico BertiniYeah, absolutely.
Kate CrawfordI think that's an excellent point. That's actually something that comes up a lot in our research lab, is thinking about how do we ask the right questions, because that's a, that's really, that's actually where the most interesting work happens. It's not just, here's a data set. How do I make it look pretty? Or here's a data set. What are some things I can pull out of it? It's, you know, what is a question that hasn't been asked in this way? What's something that we can show with data that really does answer something that's significant? And I think that's also something to do with data science starting to get more interesting. It's getting to a point where it's maturing beyond the kind of really, really early stages where it's like, oh, look, we've got all this data. Isn't it exciting?
Enrico BertiniYeah.
Kate CrawfordYeah, well, so what. But what question are you asking with this data? That's a really good sign that we're getting to that point. And actually, I find that pretty inspiring because, you know, we're getting past that all shiny new toy stage, and that means that there's going to be a lot of really interesting work done with data in the next, in the next few years. And. And that's the stuff that I think excites all of us because that's why we're doing it. But this issue of having the right research question is absolutely paramount, and that's really what it's about. That's the value of what you're doing.
Big Data Ethics AI generated chapter summary:
Enrico: We all need to work on our sense of data ethics. We can look to the 20th century to see when fields went through this. Is there like a society or a lobby for data ethics?
Kate CrawfordYeah, well, so what. But what question are you asking with this data? That's a really good sign that we're getting to that point. And actually, I find that pretty inspiring because, you know, we're getting past that all shiny new toy stage, and that means that there's going to be a lot of really interesting work done with data in the next, in the next few years. And. And that's the stuff that I think excites all of us because that's why we're doing it. But this issue of having the right research question is absolutely paramount, and that's really what it's about. That's the value of what you're doing.
Moritz StefanerYeah. And the other thing that really, for me now in the conversation, really stood out is that we all need to work on our sense of data ethics and what is a good way of treating data and asking other people how exactly they worked with the data and to make that a societal thing, because, I mean, regulations are very difficult in that area. Algorithms are black boxes. So we have to sort of make sure things are going in the right way.
Kate CrawfordYeah. And it's interesting because this isn't new. We can look to the 20th century to see when fields went through this. I mean, there was the emergence of medical ethics, which came from a whole lot of really awful things happening and people realizing that this is completely unacceptable. We need to have some very, very clear guidelines around how medical testing is done. Very similar things have happened in the space of anthropology. Anthropology has developed, over a very long time codes of what constitutes ethical ethnographic practice. And that's something which is a key part of the methodology of the field. And you would not do incredibly unethical things like deciding, well, I'm just going to spy on this person and I'm going to write a research paper about.
Moritz StefanerThem and then vastly over generalize.
Enrico BertiniYeah, that's another big issue.
Moritz StefanerNo, it's true, but we are sort of falling back into amateur stages there. But again, also because many people working with the data don't have that type of humanities background or aren't even aware that this type of standards exist. Right. Is there like a society or a lobby for, you know, data ethics? Like, is there like a club of data ethicists?
Kate CrawfordI'd be interested to make them. Maybe they're, you know, maybe they're a secret society that's kind of happening from. It's kind of. It should be happening from the organizations that represent us. So if it's the. The computer science, computer scientists known as ACM, it's engineers and it's the IEEE, I mean, I think those institutional bodies actually have a lot of power in saying what's acceptable and what's not. I also think, you know, at the level of things like journals like if this is said to be really shoddy data research that are doing things where you really haven't thought about the ethics and that shouldn't be published, I mean, this is exactly what happened in the medical field. You know, you were doing testing on kids about whether this, this particular vaccine work and you weren't telling them that this was, you know, what you're actually giving them that would be profoundly unoptimal and it would not be published and, you know, that would not be. That would not seem to be something that was okay. So I think because this is kind of a new space, this big data science is new. We're still at the stage of having these discussions about, well, what would constitute unethical practice, and we need to have those conversations soon because a lot of stuff is happening that raises very serious questions and could have serious ramifications. And we don't really have those ethical paradigms in place yet, but I think we really need to. And I think the time pressure is on.
Moritz StefanerAbsolutely. Yeah.
Enrico BertiniDo you think we can actually try to learn something from how these things have been solved in the past, like as in the medical domain, as you just mentioned?
Kate CrawfordI think so. I think that's absolutely where we begin. I think we can. We can look to. Hopefully we can actually learn those lessons without going through the disasters because in many cases, what really instituted firm policies around ethics came from profoundly unethical practices which turned into terrible disasters.
Moritz StefanerYou mean like the NSA is spying on everybody? Yeah, I mean, in Germany at least. This has sort of totally changed the conversation. I mean, so before, it was more like a luxury problem, and I think now it's becoming something that. Where everybody feels is he or she is somehow affected off?
Kate CrawfordI think it's funny that you say that. We've been having conversations about whether that could be one of the long term impacts of Snowden releasing those documents.
Moritz StefanerI think so. I think so.
Kate CrawfordThat would be fantastic. I mean, that would be such an important thing to have done that suddenly we can start to say, okay, this affects everybody. Now we have to think about what ethical data use looks like.
Moritz StefanerYeah, fantastic. Enrico, you have more questions or Kate, do you want to like have a final pledge?
In the Elevator With Big Data and Ethics AI generated chapter summary:
Kate: I'm just as inspired about getting these questions around ethics and due process. That's just as sexy to me and just as exciting to me as the data. I think education is huge here. We need to have some balanced view. When do we do the color Wheel?
Enrico BertiniI wanted to ask Kate. So before we conclude, tell us something super positive about big data.
Moritz StefanerTell us a big data joke.
Enrico BertiniA big data joke or.
Kate CrawfordTwo algorithms.
Enrico BertiniWalk into a car.
Moritz StefanerAnd they order the beers for each other.
Kate CrawfordThey say, if you like.
Enrico BertiniI mean, I'm totally sure that you are at least as enthusiastic as we are about what are the opportunities of big data as well. Right? I'm sure.
Kate CrawfordSo, I mean, I wouldn't be where I am working at the, at Microsoft research and at MIT if I didn't think that this was absolutely where there is so much exciting potential. Absolutely. But I'm just as inspired about getting these questions around ethics and due process. Right. I mean, that's just as sexy to me and just as exciting to me as the data and what we can do with the data and we need those just as much. I mean, I don't see those things as being downers, I see them as actually being exciting intellectual challenges.
Enrico BertiniYeah.
Moritz StefanerAnd once you adopt that mindset, you might come up with whole new research ideas and, you know, it's always refreshing to sort of escape that, you know, these, always working with the same type of things, always doing the same type of stuff. Once you realize, oh, hold on, I could do, I could flip this whole thing 90 degrees suddenly, you know, that's if you're a researcher or scientist, it's always exciting. Or also if you're a data analyst. Right?
Kate CrawfordYeah, completely agree. See, there you go. There's a really positive thing to say about.
Enrico BertiniNo, but honestly, Kate, thanks a lot, because I think it's super, super important having people like you around trying to write down what are the problems and explaining to people that there are problems, especially to practitioners like us. I mean, I think that's super important, right. We need to have some balanced view and we also need, I think, I think that education plays a huge role here. I mean, we have to start educating people, practitioners and experts like us. Then we need to educate other people. Right. I think education is huge here. I mean, I've been working in this area for a long time, but I have to confess that I started reading or realizing that there were problems, this kind of problems that you've been mentioning throughout the episode very, very recently. And I think it's super important. So thanks. Thanks a lot.
Moritz StefanerSo when do we do the color.
Kate CrawfordWheel.
Moritz StefanerAnd when do we found that club? Excellent. So we catch up. We can see in half a year we can report on the progress. Excellent. It was great having you, Kate. I think it's a super important and super interesting topic and one that really, once you start thinking about it, you, I don't know, for me, you cannot stop thinking about it or noticing, like, the biases everywhere, so. And, yeah, it's a curse and a blessing, I guess.
Kate CrawfordAnd it was such a pleasure talking to both of you. That was a really interesting conversation.
Paralympians speak AI generated chapter summary:
Thank you. Thank you. Thank you. Okay. Thanks a lot.
Enrico BertiniThank you.
Moritz StefanerThank you. Thank you.
Enrico BertiniOkay.
Kate CrawfordThanks a lot.