Episodes
Audio
Chapters (AI generated)
Speakers
Transcript
Data Ethics and Privacy with Eleanor Saitta
Enrico: Data stories is brought to you by Qlik, who allows you to explore the hidden relationships within your data that lead to meaningful insights. Some of my students are at the ACM CHI's conference, the major human computer interaction conference, presenting some of our work. Enrico will spend June in the United States.
Eleanor SaittaIf you are doing good work in data analysis and data visualization, one of the things that you should do is, through your work, teach people what bad work looks like.
Moritz StefanerData stories is brought to you by Qlik, who allows you to explore the hidden relationships within your data that lead to meaningful insights. Let your instincts lead the way to create personalized visualizations and dynamic dashboards with Qlik sense, which you can download for free at click de data stories. That's q l I K Datastories. Hey, everyone, it's a new data stories. Hi, Enrico. How are you doing?
Enrico BertiniI'm doing great, and you?
Moritz StefanerVery good. Yeah, no, summer broke out over here, so.
Enrico BertiniYeah.
Moritz StefanerYeah, I'm a happy camper. We're actually camping in our backyard right now to test.
Enrico BertiniThat's amazing.
Moritz StefanerLiterally a happy camper. Very good.
Enrico BertiniYeah, that's perfect for kids. They love it.
Moritz StefanerYeah. And we have to test the tent. So what can you do? Yeah, what's up for you these days? You busy?
Enrico BertiniYeah, I think it's okay. End of the semester, finally. Just waiting for my students to submit their final, final version of their projects. I'm really excited. I think the students projects are getting better and better. So this year they're working a lot on streaming data visualization, which is real time for me. Yeah, real time stuff. I have kind of like, out of 20 projects, ten, at least, are streaming this. So I've been learning myself a lot of interesting things. So I think it's a good test, maybe for future work and. Yeah. And some of my students are at the ACM CHI's conference, the major human computer interaction conference, presenting some of our work. And I'm excited as well. Yeah, I think they are going to present some good stuff there. So I'm waiting for feedback from them. Unfortunately, I couldn't go, but, yeah, it's a good place right now. I don't know if you. Have you ever been at CHI's? It's a huge, huge difference.
Moritz StefanerNever made it, actually. Yeah, I know. I should go one day.
Enrico BertiniIt's a little confusing, but good.
Moritz StefanerYeah. This year we have a cameo because there's a data edibilization workshop and one of the papers is partly on data cuisine.
Enrico BertiniOh, yeah, yeah, yeah. I actually saw one paper mentioning. No, actually a whole session that is called dear data.
Moritz StefanerYeah, that's the other one. Look at that.
Enrico BertiniThings are happening between academia and quote unquote, the real world.
Moritz StefanerOne would hope.
Enrico BertiniHow about you? What's up?
Moritz StefanerGood. I mean, I'm busy. I have one big project I'm working on a showcase type thing for a university in Switzerland.
Enrico BertiniYeah.
Moritz StefanerAnd I'm preparing my big us tour. So I will spend June in the United States, hopping from place to place IO festival information, plus conference in Vancouver and then another data cuisine workshop in Boston, actually. And in between, I'll do a few short visits to the Bay area and, yeah, some nice places.
Enrico BertiniYou should stop by.
Moritz StefanerI will do everything except New York this time. I'm really sorry, but I also need to go to those other places for once. Next time. Next time, Enrico. No worries.
Enrico BertiniYeah, yeah.
Moritz StefanerYeah. But I'll keep reporting.
Enrico BertiniYeah. Looking forward to it.
Moritz StefanerVery good. So I think we should introduce our guests. So today our episode is focused on data ethics and privacy issues, I think a super important and very interesting topic, one that reaches into many fields and super relevant for everybody, I think. And we have a great expert here. Her name is Eleanor Saitta. Hi, Eleanor.
Data Ethics and Privacy AI generated chapter summary:
Today's episode is focused on data ethics and privacy issues. Eleanor Saitta will be Etsy's security architect. She says there are also very personal data security issues.
Moritz StefanerVery good. So I think we should introduce our guests. So today our episode is focused on data ethics and privacy issues, I think a super important and very interesting topic, one that reaches into many fields and super relevant for everybody, I think. And we have a great expert here. Her name is Eleanor Saitta. Hi, Eleanor.
Eleanor SaittaHello.
Enrico BertiniHi, Eleanor.
Moritz StefanerGreat to have you on. Yeah. Can you introduce yourself, tell us a bit what you're interested in, what you're working on?
Eleanor SaittaWell, I guess the big news that I have is that as of next week, I will be Etsy's security architect.
Moritz StefanerWow.
Eleanor SaittaI'm just starting, so I'm really looking forward to that. In general, my work over the past, gosh, 1314 years now, has concentrated on the places where security rises above the machine, where kind of big socio technical systems interact with, you know, start crossing boundaries. The work that I do looks a lot at threat modeling, a lot at kind of how we understand what security is in different contexts, how we understand kind of what the requirements might be for a security system or a secure system, what the structure of how we think about outcomes may be. And a lot of that work looks not just at security as such, but kind of broader concerns of efficacy or how systems function or fail over time. And it's included everything from peer security to operational stuff in high risk contexts to constitutional law or futures, and kind of broader systems failure work.
Moritz StefanerYeah, I think it's very interesting and it's something that is often only treated from a corporate perspective. Right. Security. But there is, of course, also this whole personal data security issue. And everybody, as everybody is present in data sets and on the web, there are also very personal data security issues. Right?
Eleanor SaittaYeah. I mean, there's both personal issues, but also issues not all users are at the same risk. And this is one of the things that we frequently run into when we're working with journalists or activists or people at NGO's, is that the same tools that create some moderate risk otherwise cause really serious risks in specific contexts. Risk isn't distributed equally, just in the same way there are real problems for, say, women who are at risk for domestic violence or sex workers. These people have an elevated risk despite in theory, using the same tools and the same systems that everyone else uses.
The Surprising Ethics of Revealing Banksy's Identity AI generated chapter summary:
Scientists used publicly available data to try to reveal the identity of famous street artist Banksy. The project raises many questions about the ethical implications of using public data. This kind of stuff can change the game in terms of facilitating easier access to data.
Moritz StefanerSo one of the projects we wanted to, like, we have a couple of different, let's say, data investigations, data leaks and so on. We wanted to discuss with you because basically how we came to talking about this topic was when Enrico and I ran across the Banksy investigation by scientists where there was actually a scientific paper and a whole research project aiming at unveiling Banksy's identity. Banksy being the famous street artist who chose to be anonymous or wants to be anonymous or pseudonymous. I don't know how you say that. Yeah, and then there were data scientists trying to reveal his or prove his identity, I guess. And yeah, we found this very striking use of data, which raises many questions. So can you tell us a bit more about this project and your reaction to it, or maybe what it demonstrates from your perspective?
Eleanor SaittaI mean, I think it was interesting from a technical perspective. I don't know that I'd call it a scientific paper as much. I mean, it's a demonstration for people who want to sell algorithms into the kind of anti terror market, the kind of location that they're looking at is based on kind of pattern of use analysis of, in this case, where did the artwork show up and doing a bunch of work to kind of find geographic centers of activity. But all of the actual things that they're intending to use it for are basically figuring out how, where people live so they can be more effectively killed with drones. Now, in this case, the artists basically seemed to, or the. Sorry. The researcher seems to have decided that because Banksy's work was public enough, he had no interest in retaining any degree of privacy and could simply be kind of used as a sample for de anonymization without any consequences. I don't really understand, a, how this got past IRB and B, how they make that ethical judgment with any degree of kind of internal coherency. I don't think that there's any reason for them to assume they have carte blanche to reveal someone's identity. And I think that this is a larger failure that we see often in people doing data research is that they assume that certain kinds of scientific, you know, or supposedly scientific subjects simply don't have any personal privacy, personal rights, any validity to that, you know, kind of the ethical considerations we would use in other contexts. Now, the fact that this was originally coming from a terror context where people are looking at killing people with drones obviously says certain things about the kind of ethical framework that these people are operating in that, I mean, well, we didn't blow up anybody's wedding, so maybe it wasn't so bad. It kind of goes a little weird.
Moritz StefanerSo just to recap, what they did in the paper was they took an alleged, this is probably Banksy person, right, in the UK, I think the Daily Mirror or some other newspaper sort of. Yeah. Came out with that story a few years ago and then tried to prove by, as they say, publicly available data that it's very probable that this person is actually Banksy. Right. And so the light of reasoning was obviously, if the data is public already or somebody else has published something already, it's all right to do something or anything with it.
Eleanor SaittaYeah. I mean, this feels like it falls into this common fallacy where reducing the friction around accessing some fact from a dataset or kind of increasing the veracity of it because it's technically already out there. You haven't taken an act like when. So back in the early two thousands, Google spent a bunch of time starting to crawl public record sites and making land ownership records and court proceedings that were previously, like, online but not easy to get at, suddenly much more searchable. And this was back when they still sort of pretended that don't be evil was a thing that they were doing. And they're like, well, clearly we're not being evil because all of this data was already out there. We haven't done anything where they're looking at sort of privacy from this sort of mathematical conception of was it possible to know a thing versus privacy from an effects in the world standpoint of how much effort does some random human being have to go through to know this thing? And this is a really common split that you see where engineers tend to take this mathematical perception of privacy, whereas the people who are actually impacted by these systems are like, no, this is not a value neutral thing. You've actually made something much easier for people that I'm worried about. This kind of stuff comes up around stocking cases a lot, for instance.
Moritz StefanerSo alone, let's say, facilitated easier access to data, can already change the game in terms of ethics. And, yeah, if your actions are right or not right, yeah.
Eleanor SaittaCause I mean, it's all about. So if you have individuals who are trying to navigate sort of the complex social space of protecting themselves or people they care about or whatever the set of things that they're worried about, you end up like the calculations that they make are made on the basis of, they have certain resources, they have certain outcomes, their adversaries have certain resources and outcomes. And it's all this kind of balancing act that's very much driven by friction. It's very much driven by this kind of, like, what can I do? What will make things easier? What will make things harder? There's never, and this is something seen a lot in the high risk space. There's never any absolutes, right? You never are absolutely secure or absolutely screwed. You're always kind of somewhere in the middle. And it's this kind of shifts and friction that are what make a real difference.
Moritz StefanerComing back to the Banksy case, so I remember in the paper, they justified it, or they didn't really justify it. They didn't even talk about if it's okay or not to do this type of thing, they just did it. But they, they motivated their practice by saying, well, it could be used to chase bad people like terrorists or criminals. And I think they also treated Banksy simply as a criminal and said like, yeah, okay, this guy is like doing artworks where he's not supposed to be doing artwork, so he's sort of one of the bad guys. And that means he sort of forfeited his rights to privacy.
Eleanor SaittaI mean, it was certainly a very naive approach to kind of the relationship between art and the public and political statements. Adam. And I guess if you want to take that sort of, like, fascistic approach to public order, I mean, sure, I would counsel most researchers to possibly look at the longer term effects of what they're doing in the world and how it affects people's lives. I mean, that said, this is part of where I say that kind of the fact that they are doing this work from the perspective of the military industrial complex very much inflicts the kinds of ethical perspectives that they're bringing to it. Right. They're bringing a perspective where kind of order is good and unauthorized art is bad. And, I mean, possibly they might disagree with that, and I would certainly hope that they would disagree with that. But if they're going to disagree with that, then they need to actually do the work and not end up acting like that's the way they see the world, because their actions very much do support that kind of very naive, very black and white perspective.
Moritz StefanerYeah. And I mean, end of the day, researchers should not think they're the police. Right? And I think it's sort of good that there's sort of sort of certain separations there between worlds of science and criminology. I mean, it gets a bit more blurry, I think, when you talk about journalism, actually, because, again, it can be great journalistic act to expose bad practices somewhere. And at some point, you also maybe have to name names, right, just to. Yeah, yeah. It can be in the context of reporting, it can be good to point out individuals who do bad things and not, like, pull them out of anonymity.
Eleanor SaittaAbsolutely. But how do you make a call?
Moritz StefanerLike, when is it okay? When is it not okay? Can you offer any advice there or what's the best practice?
Eleanor SaittaI mean, I think that this is one of the things that separates out professional journalists. Like, this is one of the reasons why journalism is a profession, is that they are in the business of thinking about and understanding the structures around these kinds of ethics. I mean, broadly speaking, you know, this is one of the reasons why kind of the neutral point of view in journalism is really so rightly maligned in a lot of places these days because you can't, like, for instance, you can't do investigative journalism from a neutral point of view, right. Unless all you're doing is simply saying, well, the rule of law is good and neutral, and therefore, I will support the rule of law in all cases, regardless of the outcome in human lives. And as soon as you know, and even there, like, that's clearly not neutral, that's taking a very specific and very authoritarian position. So as soon as you're saying, well, okay, the work that I'm doing is going to have some effect in the world. You know, I'm committing acts of journalism because I want to in some way change the world. You know, I want to expose things in a way that let people see the world in a different light. Then you've already sort of decided on the kinds of impacts and the kinds of effects that you want. And then from that perspective, you need to understand, well, okay, what are those effects that I want to see? What are the tactics that I'm willing to engage with? And what is my theory of change that connects these two? If I have an understanding that taking this specific set of actions are kind of within the bounds that I'm willing to engage in, whether that's de identifying artists or revealing the names of people using offshore tax havens, if I'm willing to do that, then, okay, how does this result in the impact that I think it will have? What unintentional impacts will this have? And part of that also is understanding and accepting the fact and taking responsibility for the fact that sometimes you're going to be wrong and sort of dealing with that, you know, and you may not be able to, like, make any recompense to the victims.
Moritz StefanerYeah.
Eleanor SaittaBut you need to be, you know, you need to be certain in proportion to the impact.
Moritz StefanerYou can put that back into the box. Right?
Eleanor SaittaYeah, exactly.
Moritz StefanerYeah. I think that's an interesting, interesting problem. For instance, there was this Buzzfeed investigation, really strong piece of investigative journalism, where they sort of proved also very data heavy investigation, that there were good indications that some tennis matches were fixed by certain players, and they deliberately decided not to publish the names of the players, apparently because they wouldn't want to expose them straight away to the public outrage before being super sure it's actually true. But then there were a few people who took their data where they thought, like Buzzfeed thought they anonymized everything really well, but it was reverse engineerable, like who the people were with some heuristics and some tricks. And it's really hard to anonymize data really well. Right.
Can Anonymous Data Be Re-Identified? AI generated chapter summary:
It's really hard to anonymize data really well. Reidentification is really just breathtakingly easy. If you are doing journalism with data, you should work with a team of outsiders to figure out how that data might be reidentifiable.
Moritz StefanerYeah. I think that's an interesting, interesting problem. For instance, there was this Buzzfeed investigation, really strong piece of investigative journalism, where they sort of proved also very data heavy investigation, that there were good indications that some tennis matches were fixed by certain players, and they deliberately decided not to publish the names of the players, apparently because they wouldn't want to expose them straight away to the public outrage before being super sure it's actually true. But then there were a few people who took their data where they thought, like Buzzfeed thought they anonymized everything really well, but it was reverse engineerable, like who the people were with some heuristics and some tricks. And it's really hard to anonymize data really well. Right.
Eleanor SaittaYeah. I mean, this is something that happens again and again is that reidentification is really just breathtakingly easy. And understanding all of the different aspects on which someone might want to re identify what's happening in some case is often really difficult. I think that there's, I would almost say we've seen so many failures of reidentification that it kind of rises to the level that independent review should almost be required. And I mean, I don't mean required in a legal sense. I mean, like, if you are doing journalism with data that you are trying to preserve anonymity on some factor or another, you should work with a team of outsiders to figure out how that data might be reidentifiable. Because if nothing else, you as, and this is something we run into again and again with any kind of protective work. When you are attempting to prevent or when you're attempting to build a system, you also can't at the same time, break the system. You kind of, you know, there's one mindset for building and one mindset for breaking, and the two don't cross over. So you need to, you kind of having an outside team step in and say, okay, yes, we can help you. You know, we can help you figure out what you might be exposing that you're not thinking about. For example, the case with the New York City taxi records where they published this big swath of data about taxis with, you know, kind of all of the, all of the fair identifiers like any credit card payment records or anything stripped out. And the, I think it was a hash of the medallion number. So in theory, the drivers couldn't be identified. But what was found after the fact is that when there were places where someone could be identified at one end of a journey, and so specifically, when you had, say, famous people in New York City who take taxicabs, of which there are many who were, you know, like someone is photographed leaving their house at a certain point in time at a certain cab, and then you look in the data set and you see, hey, where did that cab go? You know, because you know where it started and you know who got into it. And then, oh, it went to a strip club, or, you know, a cab stopped at the same house in this place and went to the same house in this place at such and such times, reliably. And then all of a sudden, it's like, well, well, actually there are very few people who are likely to be taking that cab. And then all of a sudden, it turns out that you're revealing a lot of very private data, none of that information was in the dump, but it was very easy, especially when combined with other information, to pull it back out again. I mean, what one of the. I know, in looking at reidentifying geographic trace records, I want to say it was like three data points separated by 15 minutes, or maybe four data points separated by 15 minutes each is sufficient to identify 90% of the population if it happens during commute time, just because pairs of home address and work address, especially combined with commute route, are almost entirely unique. And that's to within 200 meters accuracy, kind of fairly loose information. So again, really figuring out what anonymous means is really hard. I think one of the lessons of this for people who are working with data is, if you can at all avoid it, just simply don't make raw datasets available, no matter how heavily you think you've anonymized them.
Moritz StefanerYeah, but it's such a dilemma because these are the most interesting datasets. Of course, personally, I love playing with data, and I played with the city taxi dataset. It's huge. It's like, I think, 73 million rides and super dense and very detailed. And of course, these are the most, most exciting data sets out. There are the ones of massive human activity. Right? I mean, this is the.
Eleanor SaittaAbsolutely.
Moritz StefanerYeah. Where the meat is, more or less. Yeah. And so if we say the companies and the research institutions should lock them away, I think it's a pity. Like, what can we do? Like to both let's say that everybody has access to this really interesting data, but that it's not harmful. Can this be solved in some way or do we have to choose?
Eleanor SaittaI mean, I think at the. I think at the end of the day, you do have to choose, right. Because. Okay, so there's a few different angles here. One, and this is something which the city of New York did not do in any way. Managing consent for data is really, really important. Fortunately, in a European context, we don't have any particular choice. If the city of Paris wanted to release equivalent data there, someone would go to jail. You simply can't do that. You're not allowed to because the people from whom that data was gathered didn't consent for it to be used in kind of arbitrary ways and to be released to the public, and so ensuring that the people who the data is about get to consent to what is going on and kind of what the structure of this is, is, I think, a really important start for anything like this. But then kind of even with consent. Even with consent, figuring out what you can pull out of it is hard. And I think that that's kind of the next big thing, is ensuring that when I give consent, what am I giving consent to? Am I giving consent to my anonymous ride data being used in a way that sort of exposes something about the geography of the city, maybe. Am I giving consent for my home address to be published once I'm de anonymized? Absolutely not. But the problem is that if we don't understand what we can pull out of this, we don't know what we're consenting to.
Enrico BertiniYeah, but at the same time, I guess it's hard for people to understand what they're consenting to. Right. I mean, this is complex, complex stuff with lots of implications and ramifications. So my guess is that the average person, or just doesn't. Doesn't have any idea what it means to give consent for publishing data. Right?
Eleanor SaittaAbsolutely.
Enrico BertiniYeah, I think that's also a big issue. And I think what is really interesting with privacy and data sets is that the single individual, it's hard for a single individual, on the one hand, to imagine that this data is going to be used against him or herself directly. But then it's thanks to the fact that it is possible to have information about so many people. If only one person decides to target only one person, right, as the banks case, then it is possible. So I think it's really, really complicated.
Eleanor SaittaI mean, I think we're also learning increasingly that there are more and more things that will be used against people. I mean, I think we're in a position right now where definitely public understanding, but also kind of the more general societal understanding, if we can separate those two, really haven't caught up with what's possible. Just for instance, like the number of divorce cases where Fitbit data is suddenly being used in court. Oh, wow. No one who was putting on that Fitbit to try and understand, like, hey, am I getting enough exercise? Thought that they were, say, recording their sex life in a manner that was going to be used in their divorce, and yet it turns out that a large number of people have done exactly that. So, yeah, I think that definitely people don't understand what this stuff means, but I think that part of what that ends up meaning is that we, as people who work with data are, we have an onus to educate, basically, we have an onus to help people figure this stuff out, to help people understand what's going on, because they otherwise don't have the background to get this otherwise.
Data Stories AI generated chapter summary:
This week, data stories is brought to you by Qlik. Let your instincts lead the way to create personalized visualizations and dynamic dashboards with Qlik sense. Click visualization advocate Patrik Lundblad wrote a blog post on how to make successful data maps.
Moritz StefanerThis is a good time to take a little break and talk about our sponsor. This week, data stories is brought to you by Qlik, who allows you to explore the hidden relationships within your data that lead to meaningful insights. Let your instincts lead the way to create personalized visualizations and dynamic dashboards with Qlik sense, which you can download for free at Qlik Datastories. That's Qlik.de/Datastories. And as you know, often data has a location attached to it. And maps are really among the most fascinating data visualizations out there on the Qlik blog. Qlik visualization advocate Patrik Lundblad wrote a blog post on how to make successful data maps. It's a really comprehensive overview of how you can work with dots or areas on maps and even how you can map flows and connections. Mapping is one of the most exciting fields of data visualization right now, in my opinion, and it's good to know some basic rules. So check out the blog post through the link in the show notes. It's really worth your time. Thanks again to Qlik for sponsoring us. And now back to the show, connecting.
The dangers of big data AI generated chapter summary:
Those that can have, on the one hand, can do the biggest good in the world. But on the other hand, if it is possible to have easy access to this data, you can do a lot of harm as well. There are starting to be context specific toolkits for understanding how to use data.
Enrico BertiniBack to what we were saying before. I think in a way, it's almost as if the most interesting and powerful data sets out there, those that can have, on the one hand, can do the biggest good in the world. They are also those that can do the most of the harm. Right? I mean, I'm thinking, for instance, about in my lab, we work quite a lot with electronic health records, right? And there is a lot of potential there you can literally possibly save a lot of people's lives. Right. But on the other hand, right, if it is possible to have easy access to this data, you can do a lot of harm as well. And my sense is that this is almost always true. Right. The more good you can do, the more harm you can do. And it's. I don't know, it's a terrible conundrum.
Eleanor SaittaI mean, the more impact you can have in the world.
Enrico BertiniYeah, yeah, exactly. But how do you actually go about solving this? Do you think there is a way to solve this problem? Because on the other hand, I myself wouldn't be ready to say, you are not allowed to use this, and we should restrict as many people as possible. I mean, it's really hard. Right. What's your take on that?
Eleanor SaittaI mean, my take on that mostly is that there's no generic answer. Right.
Enrico BertiniYeah.
Eleanor SaittaThat we. It would be lovely if we could simply say, well, here's how you properly anonymize data and here's the ethics book. And here you go. Right.
Enrico BertiniYeah, it doesn't work that way. Right.
Eleanor SaittaIt doesn't work like that. I mean, so there are starting to be context specific toolkits for understanding kind of how you should look at using data in very specific cases and how data can flow in very specific cases. A friend of mine wrote a great document that looks at how governments and kind of grassroots, what they call volunteer technical communities, can interact in the context of disaster response. Here's something where, yes, there's a huge impact for data, there's a lot of time sensitivity, there's a lot of these kinds of really important processes going on. There's enough commonality across the different use cases that you can start talking about. You know, this is, these are the responsibilities that you have in such and such a context. These are the broad principles that you need to take into account in such and such a context to, you know, start figuring this stuff out. I think that, for instance, we might, in five or ten years, start to have a similar set of consensus or a similar set of proposals, even in the electronic and medical health records space, which right now is unfortunately kind of a mess. I don't know if you guys saw the news that there's a Google team that's getting access to big chunks of NHS health records to plug into some kind of deep learning system for no particularly well understood outcome or reason. They haven't said what they're going to do with it. And I think the answer is they don't know yet because it's a research team, and that's the point. But obviously there are serious issues with consent there. There are serious issues with, especially given that at least one of the hospitals specifically has an AIDS treatment center there, which is included. I think there are some class issues in terms of which hospitals got selected. So there's a bunch of complexity here that needs to get taken into account over time. We'll start figuring that out, but I think that there's always going to be a need for that kind of second check of like, hey, wait, did you guys think about what's actually going on here? Have you had somebody take a real swing at the way you're anonymizing this data and see if there are kind of trivial, obvious breaks to go back and have? Probably an outside researcher, and I don't think this is quite an ethics board thing because this is something slightly different. Take a look and say, hey, are you creating axes of social discrimination in the work that you're doing? Does this work uniquely expose certain groups to specific harms? I mean, I know Uber had an internal dashboard for a while, which was their one night stand dashboard, right, where they basically had figured out, oh, there are certain patterns that tend to indicate one night stands, and so we can just actually really easily cull those from the dataset and bring them out and, haha, it's kind of funny, except, of course, what does this do again, around domestic violence? What does this do around sex worker rights? What does this do around all sorts of different groups of very at risk individuals who could be identified by that, some of which there may be Uber employees who might, you know, be abusers.
Google's Deep Learning plan for NHS data AI generated chapter summary:
Google team getting access to big chunks of NHS health records to plug into deep learning system. There's always going to be a need for that kind of second check of like, did you guys think about what's actually going on here? There are reasons to be careful and curious and hopefully sophisticated consumers of news in general.
Eleanor SaittaIt doesn't work like that. I mean, so there are starting to be context specific toolkits for understanding kind of how you should look at using data in very specific cases and how data can flow in very specific cases. A friend of mine wrote a great document that looks at how governments and kind of grassroots, what they call volunteer technical communities, can interact in the context of disaster response. Here's something where, yes, there's a huge impact for data, there's a lot of time sensitivity, there's a lot of these kinds of really important processes going on. There's enough commonality across the different use cases that you can start talking about. You know, this is, these are the responsibilities that you have in such and such a context. These are the broad principles that you need to take into account in such and such a context to, you know, start figuring this stuff out. I think that, for instance, we might, in five or ten years, start to have a similar set of consensus or a similar set of proposals, even in the electronic and medical health records space, which right now is unfortunately kind of a mess. I don't know if you guys saw the news that there's a Google team that's getting access to big chunks of NHS health records to plug into some kind of deep learning system for no particularly well understood outcome or reason. They haven't said what they're going to do with it. And I think the answer is they don't know yet because it's a research team, and that's the point. But obviously there are serious issues with consent there. There are serious issues with, especially given that at least one of the hospitals specifically has an AIDS treatment center there, which is included. I think there are some class issues in terms of which hospitals got selected. So there's a bunch of complexity here that needs to get taken into account over time. We'll start figuring that out, but I think that there's always going to be a need for that kind of second check of like, hey, wait, did you guys think about what's actually going on here? Have you had somebody take a real swing at the way you're anonymizing this data and see if there are kind of trivial, obvious breaks to go back and have? Probably an outside researcher, and I don't think this is quite an ethics board thing because this is something slightly different. Take a look and say, hey, are you creating axes of social discrimination in the work that you're doing? Does this work uniquely expose certain groups to specific harms? I mean, I know Uber had an internal dashboard for a while, which was their one night stand dashboard, right, where they basically had figured out, oh, there are certain patterns that tend to indicate one night stands, and so we can just actually really easily cull those from the dataset and bring them out and, haha, it's kind of funny, except, of course, what does this do again, around domestic violence? What does this do around sex worker rights? What does this do around all sorts of different groups of very at risk individuals who could be identified by that, some of which there may be Uber employees who might, you know, be abusers.
Enrico BertiniSure.
Eleanor SaittaYou know, and obviously, you know, using Uber as an example of bad data management practices is almost cheating because they're like such a stereotypically horrific example of bad privacy management. But they are a really good example also of the kind of harms that come out.
Moritz StefanerYeah. And again, they sit on all this richness of data and it is relevant how they, how they use it. I think it's also, I mean, we've discussed the publishing side and the data analyst side, but I think it's also from an audience perspective, I think we should maybe more often, for instance, now the Panama papers have been made public and you can search for names and companies and, you know, there's hundreds of thousands of names. You will find something on somebody.
Eleanor SaittaRight.
Moritz StefanerSo, but the question is like, yeah, should we retweet blindly anything that has a celebrity name with some data source in it, or shouldn't we also, as audience members more often just ask a critical question like how solid the reporting behind that is and things like this.
Eleanor SaittaI think that there are definitely reasons to be kind of careful and curious and hopefully sophisticated consumers of news in general. I think that that obviously goes much beyond data leak reporting into, you know, any news or any information that you get in general, though. And in that case, I would actually say that I don't, I don't believe that we have any reason to suspect that data centric stories deserve that much higher of a degree of suspicion than any other kind of story, given, you know, given the prevalence of bad reporting and propaganda online. Just kind of in general, you know, I think that there's just a general need for suspicion and sophistication there, but.
Moritz StefanerThey often seem more convincing. So when you read there's hundreds of thousands of documents behind something or, you know, or there's measurements being made that prove who Banksy is, you know, it seems more authoritative, even if it's, you know, it might be just crappy data analysis that leads to certain insights.
Eleanor SaittaI think that there is definitely a need to start becoming more sophisticated in terms of how we think about the data side of things specifically. That said, again, this is something where the onus is on journalists to educate people. If you were doing good work in data analysis and data visualization, one of the things that you should do is, through your work, teach people what bad work looks like. You know, teach people how to read an article, how to understand a data set, how to understand these kinds of analytic processes, because you want them to understand, like, why, why this is useful and why this isn't useful. Right. You want them to get the difference, you know, because, I mean, it makes, if nothing else, if you're doing good work, that makes your work look better. But also it's just your general responsibility to your audience to educate them and help with that sort of thing.
How to Address the Privacy Problem of Data AI generated chapter summary:
How to address privacy problems related to data? Either we go by law, we create new laws that try to address some of these problems. Or, as you said, educating people. Where do you draw the boundary there?
Enrico BertiniYeah, this actually makes me think about how to address these kind of privacy problems related to data. And please correct me if I'm wrong, I think I see too many routes. Either we go by law, we create new laws that try to address some of these problems, or we go by, as you said, educating people. And I guess there is also an overlap there. There are things that can be done by, can be addressed by both routes and some that can be addressed only by law and some that can be addressed only by educating people. So I'm curious to hear from you a couple of things. One is, where do you draw the boundary there? What do you think should be addressed by new laws, what should be addressed by education? And how do you actually go about educating people? Because I believe that there might be many cases out there. I mean, thinking about myself, when I do work with data, it is possible that they just do something wrong and I don't even realize it. So my guess is that having strategies to educate people about the value of privacy and how you can very easily screw up or even, as Moritz said, how do you react to some messages or news is very, very important component.
Eleanor SaittaI mean, I think what we're going to do is we're going to try all of the different strategies at once and we're going to screw up a lot and it's going to be pretty terrible for a while and, you know, we're going to kind of muddle through it because that's what humans do and that's definitely what big complex societies do. And honestly, like it is going to be terrible in places, but at the end of the day it's also going to be okay because, you know, because it's just, it's going to be what? It's going to be there. I think that, you know, so for instance, if you, if you talk about like medical studies, right, the idea that, you know, if you are talking about a medical study and you want to talk about the significance of results, that you should have a little bit of information and it can be like two sentences, right? Like two sentences and a link to Wikipedia to talk about, what does significance mean? Right? It doesn't have to be, oh, now we're going to give you a half an hour class on how to calculate a p value. That's not the relevant level of education for educating a consumer. And in the same sense, so maybe you do want to have, hey, we're propublica. Just to pick on them because they're friends. We're propublica and we're going to have a really in depth expose in conjunction with this other reporting piece that we're doing on how to talk about significance in data, right? And so you can learn everything you ever wanted to know about how to, how to calculate a p value or, you know, how visualizations can be, you know, map visualizations can be used to manipulate data or whatever the, whatever the relevant subject is, right? So you may have some of those bits and pieces, you know, but then you kind of, you patch it together over time. You know, you don't need to do it all at once. And I think that kind of like any, I mean, we're in the middle of a giant civilizational scale educational project of trying to figure out how to get human beings to understand big complex systems. And we're mostly failing at it and we're going to keep muddling around for a long time, and that's fine. And I think the same thing is true for data stuff as well.
Enrico BertiniYeah. So do you have any suggestions about where people can quickly learn, say, the fundamentals of data privacy? Right. So say I want to, is there anything out there like the ten commandments of data privacy or something along these lines? I would totally be happy to read it and spend, say, a couple of hours.
Where to Start Learning Data Privacy AI generated chapter summary:
In the journalistic circles, these questions are not new. The question of privacy and the impact of your actions when publishing something is baked into the profession. There aren't magic bullets yet because we haven't figured out where like what's important in different domains. And I think it'll take a while.
Enrico BertiniYeah. So do you have any suggestions about where people can quickly learn, say, the fundamentals of data privacy? Right. So say I want to, is there anything out there like the ten commandments of data privacy or something along these lines? I would totally be happy to read it and spend, say, a couple of hours.
Moritz StefanerOnly five weird tricks, I'm afraid. Five weird tricks.
Enrico BertiniIs there anything like that out there?
Eleanor SaittaI'm not sure offhand. I know there's a bunch of, I know there's a bunch of data journalism centric conferences out there like NICAR, the National Institute for Computer Assisted reporting has an annual conference, and I think some of those folks are starting to pull together some of these kinds of best practices. I know NICAR has a bootcamp on how to use some of this stuff. It's not necessarily entirely there yet, but it's getting closer. Honestly, like with so many things, start by reading Wikipedia on de anonymization and re identification and chase references and that kind of thing. You know, there aren't, again, there aren't magic bullets yet because we haven't figured out where like what's important in different domains and a lot of these things. And I think it'll take a while, but I think that if you're doing this kind of work, yes, look at what the journalism and the data journalism world is doing, because they are the people who are pretty much on the front line of this. And then just follow the conversations because it's going to be an ongoing conversation for a while.
Moritz StefanerYeah, yeah. And you're absolutely right. In the journalistic circles, these questions are not new. Like if you look beyond the topic of data, specifically the question of privacy and the impact of your actions when publishing something is baked into the profession, basically. I have a last question. We have to wrap up soon, unfortunately, but last question is coming more from the potential consumer side or. Yeah, everybody is a person who wants to protect their privacy. What are your tips for in this messy in between time that you describe for protecting yourself or being maybe less vulnerable? Vulnerable towards, towards being exposed. In the worst case, what do you say? People should use more encryption, have a couple of Personas in parallel, use less cloud services, or doesn't it make a difference anyways? What's the personal perspective here?
Are You More Vulnerable Than You Think? AI generated chapter summary:
Eleanor: What are your tips for protecting yourself or being maybe less vulnerable? In the worst case, what do you say? People should use more encryption, have a couple of Personas in parallel, use less cloud services. It's definitely worth thinking about what the applications you use are exposing to the world.
Moritz StefanerYeah, yeah. And you're absolutely right. In the journalistic circles, these questions are not new. Like if you look beyond the topic of data, specifically the question of privacy and the impact of your actions when publishing something is baked into the profession, basically. I have a last question. We have to wrap up soon, unfortunately, but last question is coming more from the potential consumer side or. Yeah, everybody is a person who wants to protect their privacy. What are your tips for in this messy in between time that you describe for protecting yourself or being maybe less vulnerable? Vulnerable towards, towards being exposed. In the worst case, what do you say? People should use more encryption, have a couple of Personas in parallel, use less cloud services, or doesn't it make a difference anyways? What's the personal perspective here?
Eleanor SaittaI mean, I think it, as always, it depends what you're worried about and what the, what the threat model and what the risk model that you're thinking of is. If you're someone who has an abusive ex, then, yeah, you may need to do a lot more work to kind of maintain a low profile online, at least as far as your address is concerned or whatever, that kind of thing. On the other hand, in general, I guess the biggest thing that you can probably do is get a password safe, use different passwords for every site, use strong passwords as far as making sure that compromises in one place don't lead to compromises in other places. If it's relevant for you, credit monitoring may be something that you want to look at. And I hesitate to. There's all of these so many kind of scammy credit monitoring services. It's its own kind of morass. I think it's definitely worth thinking about what the applications you use are exposing to the world and choosing whether you're okay with that. And I'm not going to say that you should make some specific choice or other. You know, you get to make that call as far as what you're okay with. But I think that looking at what those applications do expose and, you know, thinking about the kinds of applications you want to run, you know, like, there's a lot of stuff out there that's run by, you know, scammers or spammers or whatever. So developing some kind of, you know, literacy around that sort of thing, you know, and just being, being a bit careful. That said, you know, you are part of a civilizational moment where we are bad at this thing, and that means that you're going to be bad at this thing, too. And there's only so much that you can do unless you want to spend a lot of effort on it. And that's definitely one of the things, you know, I mean, expect that, yeah, things may go wrong, but you still have to kind of exist within this larger world and there's only so much that you can do to fix that in any given way.
Moritz StefanerYeah, yeah. My guess is in 2025 or something, we all declare identity bankruptcy and everybody can pick a new name and we start over. Yeah, no, it's a fascinating topic and I think it's really good to sometimes think about the other side of. We often have guests and conversations that are very excited about data and do cool stuff with data. But I think it's also very important to think about all the potential hazards and spot also when something's going wrong with data. And I hope this episode contributed a bit to sharpening your view on that. Thanks so much, Eleanor, for being on the show again. We could have talked another hour for you, but we have to wrap up soon. You should check out the blog post for the episode. There will be lots of materials and maybe we'll put a video of a talk or so if you want to hear more from Eleanor. And we're super excited to see you going to Etsy. I think that's a great move and we'll follow what you can put into practice there.
Eleanor SaittaYeah.
Moritz StefanerSo thanks so much.
Eleanor SaittaCool. Lovely to talk to you guys.
Moritz StefanerYeah, thank you.
Eleanor SaittaTalk to you later. Bye bye bye.
Enrico BertiniThank you. Bye bye. Hey, guys, thanks for listening to data stories again. Before you leave, we have a request if you can spend a couple of minutes rating us on iTunes, that would be extremely helpful for the show.
Data Stories AI generated chapter summary:
Hey, guys, thanks for listening to data stories again. We have a request if you can spend a couple of minutes rating us on iTunes. Here's also some information on the many ways you can get news directly from us. Don't hesitate to get in touch with us.
Enrico BertiniThank you. Bye bye. Hey, guys, thanks for listening to data stories again. Before you leave, we have a request if you can spend a couple of minutes rating us on iTunes, that would be extremely helpful for the show.
Moritz StefanerAnd here's also some information on the many ways you can get news directly from us. We're, of course, on twitter@twitter.com. Datastories. We have a Facebook page@Facebook.com, datastoriespodcast all in one word. And we also have an email newsletter. So if you want to get news directly into your inbox and be notified whenever we publish an episode, you can go to our homepage datastory es and look for the link that you find on the bottom in the footer.
Enrico BertiniSo one last thing that we want to tell you is that we love to get in touch with our listeners, especially if you want to suggest a way to improve the show or amazing people you want, want us to invite or even projects you want us to talk about.
Moritz StefanerYeah, absolutely. So don't hesitate to get in touch with us. It's always a great thing for us. And that's all for now. See you next time, and thanks for listening to data stories.
Enrico BertiniData stories is brought to you by Qlik, who allows you to explore the hidden relationships within your data that lead to meaningful insights. Let your instincts lead the way to create personalized visualizations and dynamic dashboards with Qlik send, which you can download for free at www. Dot clic dot de stories.