Episodes
Audio
Chapters (AI generated)
Speakers
Transcript
Visualization and Statistics with Andrew Gelman and Jessica Hullman
This is a new episode of Data stories. We talk about data visualization, analysis, and more generally, the role data plays in our lives. Andrew Gelman and Jessica Hullman talk about the relationship between statistics and data visualization. If you enjoy the show, please consider supporting us.
Andrew GelmanPeople don't always understand the importance of style or the necessity of style, but it's not a luxury. It's more of an organizing principle.
Enrico BertiniHi, everyone. Welcome to a new episode of Data stories. My name is Enrico Bertini, and I am a professor at Northeastern University in Boston, where I teach and do research in data visualization. Right.
Moritz StefanerAnd I'm Moritz Stefaner. I'm an independent designer of data visualizations. And, in fact, I work as a self employed truth in beauty operator out of my office here in the countryside in the beautiful north of Germany.
Enrico BertiniExactly. And on this podcast, we talk about data visualization, analysis, and more generally, the role data plays in our lives. And usually we do that together with a guest we invite on the show.
Moritz StefanerBut before we start, just a quick note. Our podcast is listener supported, so there are no ads. But that also means if you do enjoy the show, you might consider supporting us. You can do that with recurring payments on patreon.com Datastories, or you can also send us one time donations on Paypal me Datastories.
Enrico BertiniExactly. Okay, so I think we can get started with the main topic and guests for the show today. So today we have two people on the show to talk about, I think, a really relevant topic. I think, in general, what is the relationship between statistics and data visualization? And to talk about that, we have Andrew Gelman and Jessica Hullman. Hi, Andrew and Jessica. Welcome to the show.
The Future of Data Visualization AI generated chapter summary:
Andrea: I teach statistics and political science at Columbia University. Jessica: I'm an associate professor of computer science at Northwestern University. I do research on how people draw inferences from data, usually from interfaces.
Enrico BertiniExactly. Okay, so I think we can get started with the main topic and guests for the show today. So today we have two people on the show to talk about, I think, a really relevant topic. I think, in general, what is the relationship between statistics and data visualization? And to talk about that, we have Andrew Gelman and Jessica Hullman. Hi, Andrew and Jessica. Welcome to the show.
Moritz StefanerHello.
Andrew GelmanHello.
Enrico BertiniSo, as usual, we start by asking our guests to introduce themselves. So maybe, Andrea, you want to go first and give a brief introduction?
Andrew GelmanI teach statistics and political science at Columbia University.
Enrico BertiniJessica?
Jessica HullmanHi. Yeah, I'm an associate professor of computer science at Northwestern University. I do research on various topics related to how people draw inferences from data, usually from interfaces. So I care about things like visualization.
The Data Visualization blog AI generated chapter summary:
Andrew started the blog several years ago. It's called statistical modeling, causal inference, and social science. There's a lot of interesting discussions about statistics and data visualization. It is difficult to write for your blog.
Enrico BertiniOkay, so I thought we would start our conversation by starting from the blog that I believe Andrew started several years ago. It's called statistical modeling, causal inference, and social science. I think it's a really influential blog, and I remember reading the blog since many years. And it's a really interesting community of people, and there's a lot of interesting discussions about statistics, but also political science and science in general, and also a lot about data visualization. And I believe Jessica joined the team recently. So we have seen even more data visualization conversations in that space. So, Andrew, I was thinking maybe you could give us a little bit of a overview of what the blog is about, maybe if you want even to say how it started and what are the main topics there in 2004, I.
Andrew GelmanWas working with a postdoc, Samantha Cook, and we had an idea of setting up a blog and a wiki to help us communicate with each other. The idea of putting stuff on a blog was that then other people could see things too, and we could get input from other people. Then the wiki was supposed to be where we put our various ideas. The wiki got hacked and we had to take it down. But the blog was useful. I learned, though, what was hard. It's hard for most people to write stuff on a blog. Sometimes I would suggest that students or postdocs write a post and they would find it too difficult a task. They would find it too, like pressure fold. So it ends up mostly being me. But then various other people, like maybe about 15 other people, such including Jessica, have had stuff to say. So I've asked them if they could write for it too. So it's a way of getting having conversations.
Jessica HullmanIt is difficult to write for your blog. I was just going to say, andrew, maybe it's hard for you to understand, but I think, I mean, you've established like, quite a record with it that I think a lot of people tune in to hear what you're gonna say. So it is. I understand why other people would be like, oh my God, it's too much pressure. Cause I had to just get over it and be like, I don't care if they're gonna compare me to his posts, then they're not gonna be as good. But it is tricky.
Andrew GelmanYour posts are definitely better than the average post of mine.
Enrico BertiniSo, yeah, I think what is really interesting there is that every time I look at the post, there's such an interesting series of comments below. You seem to have a very active community around it, and it's always very thoughtful and yeah, and I don't know if you need any special moderation, but also, normally the comments are pretty interesting and nothing too bad. There's no crazy people writing crazy stuff as far as I can tell.
Andrew GelmanYeah, it's not so interesting for the crazy people, I guess.
Enrico BertiniYeah.
The role of style in data visualization AI generated chapter summary:
Everybody needs to have their own theory of statistical graphics. Like in the same way as if you're writing, you need to have a theory of writing. The more explicit you can have that model or style or system, the more that you can then know when to improve it or abandoning it.
Andrew GelmanBut I wanted to pick up on something that you said earlier, not about the blog, but about statistical graphics. That because I'm a user of graphs ever since I was a physics student, and graphed data and graphed curves and models and so forth. And I think over the years, it struck me that I think everybody needs to have their own theory of statistical graphics. Like in the same way as if you're writing, you need to have a theory of writing. Or if you're drawing, you can't just say, I'm going to draw what I see. You have to have a kind of approach. It's goals. Or if you're making music, you can't just say, hey, let's bunch of us, let's form a musical group and do music. Right? You have to have music that you want to do. You have a certain style. It doesn't mean your style is better than everybody else's, but you need that. And I think that for quantitative things, people don't always understand the importance of style or the necessity of style. Like, style is seen as a kind of luxury, but it's not a luxury, it's more of an organizing principle. And that's also clear with writing, that if you're writing, except for perhaps the most functional writing, like the instructions for how to operate your microwave oven or something like that, you need to have a style, because if something is boring to read, then no one will read it. If you teach as a teacher, if you teach a class and write a two page document for the students to read, they won't read the two page document. And so it's really, it's needed. And I think the flip side of that is that, on the other hand, then you have people saying, oh, someone's a designer, as if there can't be any connection to science, because design is supposed to be some like separate thing, but there, one can draw the analogy of something like building bridges or that these things are designed, but they still have to work.
Enrico BertiniYeah, I really like the way you are describing this because in a way, it reminds me of something that maybe I was hoping to discuss that. To me, it reminds me of the role of theory. Right. The fact that if you don't have a theory, theory doesn't have to be necessarily perfect or always super predictive, but it's a way to organize your thoughts so that you can think systematically about something. So I don't know if this rings any bells on your side, but the way you describe style to me reminds me the how important it is to have theory and also a theory of graphics in this specific instance.
Andrew GelmanWell, it's like they say in chess, having a plan won't win you the game, because presumably you're playing against someone else with a plan, too, and you're not both going to win. But if you don't have a plan, then you'll lose. You won't be able to move forward. And part of having a plan is recognizing, being aware of that you have a plan, being aware of what the plan is. And then when things go wrong, you can change things. So actually, it's like putting your, as a scientist, like putting your marker down. I'm not really into, like, betting. It's not like I would say you literally have to bet money on things, but, like, conceptually setting it down and saying, this is the model I'm going with is very valuable, even though, or I should say especially though we know that that model is going to be wrong and it's going to fall apart at some point. But the more explicit you can have that model or style or system, the more that you can then know when to work on improving it or abandoning it.
Jessica HullmanYeah, we actually wrote a paper about some of this as it applies to sort of theories of visualization for exploratory analysis, where I think that's a place in visualization where we're building all these tools to help people do interactive data analysis through visualization. But we don't always, like, if you ask us, the people developing these tools or the researchers in the area, like, what our sort of guiding principles are, I think it's often what we just want to let people explore data as easily as possible. But I think you can easily run into places where you just, like, you have no theory to tell you how to design something in sort of a better way versus a worse way. Like, you just don't know. And so we wrote a paper for the Harvard Data Science Review, like, about a year or so ago, where we sort of talked about this as applies to exploratory visual analysis. And we argue that even if it's a bad theory, like Enrico was saying, or both of you guys were saying, that it still can be useful because you need to know sort of how you were wrong. And if you never state what you're going for, what you think the objective is, how do you know when you were wrong?
Enrico BertiniYeah, that's an area of data visualization that I really love, and I think people tend to talk less about it. I think there is more of a, I think the general idea with visualization is that it's a tool for communication, but there is less, I would say there is less discussions about how to use it for exploration. And by the way, the word exploration itself is so contentious in a way.
Moritz StefanerIt's funny, I thought it's the other way around that everybody talks about exploratory.
Jessica HullmanYeah, I was kind of thinking in academia, right?
Moritz StefanerIn academia.
Enrico BertiniIn academia. Yeah, yeah, yeah, yeah, yeah. Absolutely.
Moritz StefanerI mean, and nobody, like, really gets what communication means.
Andrew GelmanWell, I think. I think that one problem is communication is often viewed as being a kind of unidirectional thing. So, like, they'll say things like, scientists should learn how to tell stories because people think in terms of stories. And I hate that kind of attitude. I mean, sure, people think in terms of stories, but. But this attitude that, oh, you're the scientist, you already know the answers, but now you have to convey it to people. So you have to learn how to be a storyteller and have a good bedside manner. Right? Like, it's all, like, connected with, like, oh, you don't want to be a jerk, right? Like you ought to be. It's like narrative medicine, all this stuff. And it's like, what people are actually doing there is great, but the idea, the framing that it's all about how to communicate truths to people, I think is misleading. I think it's more accurate to say that we're people, too, and we learned from stories. So this is something that my colleagues and I have been thinking a lot about over the years, like, what makes us believe things. And often we're convinced by stories. And I characterize the effect of stories as being anomalous and immutable. And by anomalous, it's a story as a surprise. So the convincing stories have some twist in them, something unexpected. Even if you think about a scientific method. I didn't think this method would work, and then it did, or whatever. And then it's immutable in the sense that a good story is grounded in reality. And, like, if I. If you kick it, your foot hurts, right? Like, as in the famous Boswell story. And so you have, like, so we, we kind of learn from, or we learn from these stories, which are a reality. I mean, maybe the term story isn't the best in that sense, because I'm not talking about stories that are made up, but I'm talking about true stories. But we learn from these. But it's kind of necessary that the stories have this grounding in truth so that they can disprove our theories. And there's a sense in which a good story, like, if you were to take all of the things that people have said in your podcast, right? So maybe you've done however many podcasts, and each podcast, on average, maybe there are five stories. Like when someone's tells you, hey, let me tell you a story, blah, blah, blah, blah. And if you look at those stories, some of them are going to be just made up, right? I mean, we have these horrible examples of stuff where people just say, like, there is a famous example from a few years ago in a book where somebody said that a certain, it was like a certain data problem caused 70 deaths a year in a small town. And it was like, how the hell, like it made no sense, right? So they just made it up. But I think this stories that are good, like if you could in theory track them down and I think they would have this characteristic that they disprove an implicit model of the world.
Jessica HullmanRight.
Moritz StefanerDoes it have to do with news newsworthiness as well? Like in journalism you have certain newsworthiness criteria.
Andrew GelmanExactly. Dog bites man bites dog. But here's the point, is that when something is surprising, it's surprising relative to an expectation. And so that model of the world, is that. So that when we talk about discovery and surprise, there are theories implicitly there already?
Exploratory Visual Analysis and the Model Check AI generated chapter summary:
A good graph is helping us check a model, some implicit model, often, sometimes an explicit model. There's always some expectation and the graph tells us how much the data deviate from that. You can think of a visualization as giving you almost like a test statistic in a hypothesis testing type framework.
Jessica HullmanYes, this is, yeah, I was going to tie this back to the exploratory analysis thing as well. Like I think one of, I mean, I think we talk about exploratory visual analysis a lot in vis, probably more than communication. But, you know, it's, there's always this role of expectations and what you're bringing, like what you're expecting to see both in communication and exploratory visualization analysis. And I think at least in visualization research, that's something that we've always sort of backgrounded. We've always acted as though the data speaks for itself when it obviously does not. Yeah, so I think, yeah, that's all. Like theories of how visualizations act as a model check are one thing that Andrew and I have, like in the same paper I mentioned already, he had been thinking about this years ago as a way to sort of think about the role of graphics in exploratory data analysis and how in a sense you can tie it to confirmatory data analysis through this idea that a good graph is helping us check a model, some implicit model, often, sometimes an explicit model, like in confirmatory data analysis, but there's always some expectation and the graph tells us how much the data deviate from that.
Enrico BertiniSo when you say model check here, do you mean like checking the model that you have stored in your head like a mental model?
Jessica HullmanYeah, I mean, I think, yeah. So in exploratory visual analysis, I mean, I think there's always some background assumption potentially and a lot of times, and this is stuff that Andrew had spoken about back in 2003 in a paper. So you can cut me off whenever, Andrew, and tell it yourself, but basically you can think of a visualization as giving you almost like a test statistic or a vector of test statistics in a hypothesis testing type framework. If you want to look at it that way, where you have something that you're trying to check for, you make a graph in order to check how well does my data conform to some expectation? And sometimes that's really explicit and built into the visualization. In a model fitting or preliminary model fitting stage of a workflow, you're looking at things like maybe residuals, and you know exactly how to read the chart because it's built into it that if your data deviates from the expectations that you want your residuals to have then or to fulfill, then it'll be obvious because you'll see deviations from symmetry in the plot or even a scatter plot. Often, if you have a bivariate scatter plot, the most common sort of built in thing that you're checking against is a straight diagonal line representing perfect linear association. So the idea that back in 2003 that Andrew started talking about and other people statistical graphics have also gotten into, like Andreas Buja, Diane Cook and others, there's this work in sort of graphical statistical inference that gets into this idea of how visualizations can function sort of as model checks. But then within that, people have gone in different directions where Andrew's original formulation, I believe, was sort of more an abayesian direction, where it's not that we're testing some hypothesis and we just want sort of our p value or kind of yes, no answer, but it's. It's that where you can think of the graph as kind of like the comparison that you're doing mentally when you look at a graph is kind of akin to doing like a posterior predictive check invasion stats where you're sort of imagining like under my expectations about the process that created my data. What do I expect the data to look like? So what would reasonable data look like under the predictions I want to make? And how much does the data that I actually got sort of compare to what I would expect? So it's almost like you could imagine on some level, maybe this doesn't happen all the time, but it's almost like when you look at graphs, you're sort of imagining reasonable data under some set of expectations you have, and you're comparing that to what you see. Andrew, I don't know if you want to say anything there. I think that was just sort of the idea I wanted to bring up.
The paradox of stories in science AI generated chapter summary:
We consume stories in order to refute various models of the world. But yet the paradox is that how can we learn from surprising things? It's related to a kind of Popperian or lakotoshian view of science, which is that we alternate between normal science and scientific revolutions.
Andrew GelmanWell, yeah, I have more to say about that, but actually let me jump to something else, which is a paradox that we learn from stories and the best stories have surprises in them. Like I would almost argue, all stories have surprises in the sense that if there's no surprise, you don't bother telling the story. So we're always using, just as we use graphs to learn and to discover, which means to be surprised relative to our implicit models. We consume stories in order to refute various models of the world. But yet the paradox is that how can we learn from surprising things? Like it seems the usual way we think about statistics is that we learn from the expected, like random samples, right? That's like, if I'm going to do a survey, I don't say I found the 1000 weirdest people in America and asked and their opinions about things, and I really wanted to be surprised. And you'll never guess what I found. No, what you'll do is you want to ask a representative sample of people and like, you don't really have the goal of being surprised. You just, you want to see that. So this was sort of bothering me, actually, because after I wrote the paper, the paper that my colleague and I wrote about why, how we learned from stories I drew directly connected to the papers that Jessica mentioned, where we earlier, I had written that graphics are a form of model check. So then I argued stories are a form of model check. But then again, people ask the question, and how can it be that you learn from anomalies? And my, I don't know, like my resolution of this is that it's related to a kind of Popperian or lakotoshian view of science, or cunian view or whatever, I guess the standard view now of science, which is that we alternate between normal science and scientific revolutions. And when we're doing normal science, we are want kind of representative samples and we want to help build, and we want to build theories and modify our theories. Then when we're doing revolutions, we're trying to see what's wrong with our theories. And there we're looking for counter examples. Now, I'll only say one more thing, which is that when you say this, it always sounds like revolution is the hero. And normal sciences is like the loser in this game. But that's not true, because the revolution only exists because there was the normal science, it allowed it. And the goal of the revolution is to replace it with a new normal science. And so it's like both of these steps are important.
Anomaly and the Scientific Process AI generated chapter summary:
How open are people really, to changing their minds based on statistical information? Should you have to go out explicitly searching for anomalies to break a theory? Could we kind of, through our own, like, just seeing the data that we collect, find anomalies?
Moritz StefanerI have a question here. How open are people really, to changing their minds based on statistical information? I think it's something like the last few years has maybe been an interesting research topic.
Jessica HullmanI was thinking, too, as Andrew was talking about anomalies, should you have to go out explicitly searching for anomalies to break a theory? Or if we were all honest scientists, could we kind of, through our own, like, just seeing the data that we collect, find anomalies? Like, if we're willing enough to sort of admit when our mindset is not right, then we should be seeing probably anomalies a lot.
Andrew GelmanWell, this is kind of related to this, like, unitary nature of consciousness thing, or even the idea that we talked about earlier of having a plan. It seems pretty fundamental, like, to mathematics, the way cognition works in general, not just like human brains, that it seems like there needs to be this executive function and this alternation of processes. So, like the same, you need a division of labor somehow. So maybe one scientist could create and refute her own theories and gather data, but maybe not all at the same time. I mean, another example is, in math class, way back when, when you're asked, sometimes you're asked to either prove something or come up with a counter example, and they always say you can't do both at the same time. You have to first assume it's true and try to prove it, and then if you can, stop and assume it's false and. And try to do that. So we do kind of use the multiple humans in the system to play different roles.
Quantum visualization: A skill and a practice AI generated chapter summary:
visualization is really a skill and a practice, and there's no single right way, but it's like a highly personal thing, which how you do it right. The deeper you go, the less clear it is how things should be.
Jessica HullmanYeah.
Moritz StefanerI have another question, like, looping back to the beginning. So I think you rightly explained visualization is really a skill and a practice, and there's no single right way, but it's like a highly personal thing, which how you do it right. And there could be many ways to do it right. Is it the same for statistical practice, like for applying statistics? Or is it in statistics more for a given problem? There is a correct solution.
Andrew GelmanOf course, there are many ways of solving problems. I actually wrote something once about what I called the methodological attribution error, which is people attributing to their method what's also a property of their skill. And so you'll see this with renowned statisticians, or maybe not so renowned statisticians also, that they just think some method is inherently better. But, yeah, there's always so many unwritten.
Moritz StefanerRules, and they might just be better at applying it or failing to apply another technique successfully, which somebody else might have.
Andrew GelmanYeah, I'm better at some techniques than others. That's how it works. So there's an interaction.
Moritz StefanerYeah, yeah. It's interesting because from the outside, so I have only statistics 101 knowledge. Right. And to me, it always seemed there's this clear decision tree of if your data is shaped like that, then you need to apply ANova test or something like this. Right. And we have the same for graphs. If you want to spot outliers, use a scatterplot. Right. And so I was always wondering how, how hard cut these rules are. And I'm glad to hear that. It seems very similar, actually, that the deeper you go, the less clear it is how things should be.
Andrew GelmanI was influenced by a colleague, David Krantz, a psychologist, who he was telling me about, like decision theory. And he said, the simplest version, decision theory, is what you learn like that. Like von Neumann, Morgenstern, you have a decision tree that you need to evaluate. So you compute all the things that go into it and you compute the tree. And then the next level of sophistication is to say, no, actually drawing the tree is important. And there's lots of psychological experiments where they show people the tree and it's missing a branch and, like, people don't realize. So, like, a lot of examples where the best decision is something that wasn't in the tree in the first place and no one tried it. But then he said that's also not enough. And so his take on how to on, like, decision analysis is that you start with goals. So you have goals and resources and stakeholders and all of that sounds kind of soft, but it's not really softer than trees. So you basically start with design thinking again. Well, yeah, you start with your, you start with your goals and then all the other things, your goals and your constraints and your resources, and then you consider ways of getting there while being open to that, your goals might change and so forth. So, yeah, statements like, if your data look like this, you should use this model. That's like, totally. That's horrible because you really want to be starting with your goals and, you know, and not in an empty way, like, oh, yeah, my goal is to publish a paper. My goal is to get this data analysis done. Like, you know, your serious goals, whatever they are.
Moritz StefanerMy goal is to not get shouted at on the Internet foremost.
Andrew on Talking to Jessica AI generated chapter summary:
Andrew: Thanks for the opportunity, for talking with you all this, this is always fun. See you all later. Okay, and now we can continue the rest of the episode with Jessica.
Andrew GelmanYeah, well, so I get it. I have to go now, so I'll see you all later. But thanks for the opportunity, for talking with you all this, this is always fun.
Enrico BertiniThanks so much.
Moritz StefanerWonderful. Thanks for joining us.
Enrico BertiniThanks, Andrew.
Andrew GelmanSee you all later. Thanks again. Bye.
Enrico BertiniOkay, and now we can continue the rest of the episode with Jessica. So one question that I had going back to, let's say, the comparison between statistics and visualization right in my head, I'm always like, I can't say that one is better than the other. Right? I don't know. So, for instance, when we teach visualization, we show the Anscombe quartet. I think we have a huge bias there because it's almost like, take that, statistics, right. This is so much better.
Comparison of Statistics and Visualization AI generated chapter summary:
In visualization, we sort of take kind of like the very initial stages of exploratory analysis. The whole idea is that I want to use stats to help me figure out what signals are actually there. When is visualization sufficient without any follow up?
Enrico BertiniOkay, and now we can continue the rest of the episode with Jessica. So one question that I had going back to, let's say, the comparison between statistics and visualization right in my head, I'm always like, I can't say that one is better than the other. Right? I don't know. So, for instance, when we teach visualization, we show the Anscombe quartet. I think we have a huge bias there because it's almost like, take that, statistics, right. This is so much better.
Moritz StefanerThis is how I use it at least.
Enrico BertiniRight, right. Everybody uses that way. It's like, here we go. That's the, that's why we need this. And, but no, because I think there's almost like a dance between having a lot of details so that we can maybe related to what Andrew was saying. Right. The surprising elements. But you can't do, you can't reason only with surprise. Right. And surprise can also overwhelm you and you may lose the signal as you look at a lot of noise. Right. So I really see that as a dance between these two. The surprising, the particular, but also surfacing the signal. So, yeah, yeah. How do you think about that?
Jessica HullmanI mean, it reminds me of like, you know, Tukey and others who have written about like, the exploratory data analysis process, where, I mean, my personal view is that in visualization, we sort of take kind of like the very initial stages, you know, where. Well, the very initial stages of exploratory analysis are often something just making sure there's no massive chunks of missing data, figuring out what variables you have to begin with. But I would say we take this, then next stage of, I'm just trying to see where is there some signal or what looks like a pattern. What kind of relationships do I think I see we in visualization think of that as exploratory analysis. It's very open ended, clicking around to find patterns. And I think we design kind of as though that is what exploratory visual analysis is all about. Tukey, for instance, though, talked about this sort of intermediate phase where you've noticed, you've sort of generated some hypotheses about possible relationships between variables or the nature of certain distributions. And then you sort of need to know how much can I believe what I think I see here? Um, and, and so, you know, there's also like, you know, exploratory analysis also involves things like starting to fit models to try to explain. Like, if I think that this, you know, set of variables seems to be predictive of some other variable that I care about, you know, I would actually start fitting models and looking at deviation in things like residuals, seeing, you know, how well do my models actually explain what I'm seeing? And like the whole idea is that I want to like build up some sort of mental model, kind of the data generating process. I want to use stats to help me figure out what can I believe in terms of what signals are actually there and what is just maybe not actually going to hold up when I inspect it more closely. I think there's this weird stage or this weird way in which graphs are used not just to show us things that maybe we didn't expect to see or we did expect to see, but also to give us some information about how much we should believe those things. And I think that's where it's sort of kind of ambiguous. Like, so actually from Andrew's blog, I learned about this sort of informal term someone used called the anthropic principle of statistics, which is like, there are certain problems where you would use statistics, like if your data, if the signal is so huge relative to the noise, you don't really need to be running stats on that. You can just sort of see it. And maybe you just make a graph and it's obvious if the noise is very large relative to the signal, then it's sort of hopeless and you could do stats, but you're still kind of dealing with too much noise. And so it's like statistics is useful for this sort of middle set of problems. And so I think, I mean, one of the questions we'll ask year or two for me is like, you know, I think of this like, you know, we have like the problems in the middle where we can use maybe visualization, like separately from stats. You know, like what, what do we think is a problem where visualization is simply, you know, not going to work? When is visualization sufficient without any follow up? And I don't think there's like true or right answers. These are things we have to sort of figure out as a field. But I think it's not. We haven't always made explicit sort of what our, what our assumptions are. I think we often maybe implicitly, when I look at what people write about exploratory analysis and designing for exploratory visual analysis, I think there's this assumption that people can click around for patterns and maybe they care enough about finding the right answers. There's some application or some reason why they're analyzing the data. We trust that they will look at things enough and make enough views, and some of the views will be disaggregated enough that they can get a sense of the noise. And so we don't have to worry about explicitly supporting these signal to noise kind of judgments. We can just let people use these tools, and they will figure out what they can trust and what they cannot trust. And probably they'll follow it up, if it's really important, with some further data collection, and then they'll officially test any hypotheses that are really important. I think we just sort of assume we don't even talk about a lot of it. But I think I get the impression that that's kind of what we imagine. And I think there's really interesting questions. I think as someone who studied uncertainty for a long time and for a while, I was like, we have to be visualizing uncertainty way more than we are. I think some of the things I've seen in my research, just with how robust these tendencies people have to just want to see things summarized or to just want to rely on statistical summaries over sort of raw data are. I mean, they're really kind of compelling in the sense that I think there's a lot of cases where people sort of looking at aggregated data can actually work. It just kills me that we don't have any sort of good formal way of describing why that is. So I think I've sort of. One of the reasons I feel like I'm being pushed more towards theory in the sense of trying to set up almost mathematical frameworks to understand some of these things, is that I want to understand why? Or how do you explain that if you have someone clicking around in a biz system, trying to find patterns that ultimately they are doing okay, ultimately they find the correct ranking of patterns, or whatever it is, for the task. So I think, yeah, I'm really curious just to use theoretical frameworks to explain things that I don't understand. Why does this work out? I think, for instance, maybe there's certain ways in which a visual analysis process is redundant, where you're looking at the same data in multiple ways. And so if people kind of under update their beliefs, often when they see a data sample, if you're looking at the same data sample multiple times, maybe sort of over time, you're kind of like internalizing it. I think there's all sorts of ways in which behavioral econ can sort of help us, as well as theories of statistical learning. So, yeah, I think it's like, I don't know that that's where mainstream vis is ever going to go. But for me, it's sort of these questions about exactly when is visualization sufficient open this whole can of worms that just makes me think, like, okay, we have to sit down and really try to figure out, can we explain to ourselves how this paradigm works?
Enrico BertiniYou touched upon so many interesting points. I don't know where to go next.
Unveiling the Hidden Messages of Data Visualization AI generated chapter summary:
visualization is often this first step towards letting us take what we know and try to apply it. The current tools don't seem to really include any special functions that help people reason more about their prior knowledge. There is a huge, there's an interesting space there where something could be done.
Moritz StefanerI have a question. I'm sorry. Still stuck with that Anscombe quartet thing. So for those listeners who don't know what it is. So it's sort of a toy example to demonstrate why visualization is cool. And the idea is you have four artificial data sets that all have the same summary statistics and mean same standard deviation. Like, broad summary statistics are identical, but when you plot them, you see four very different shapes. Right. And so I was wondering, is there an inverse Anscombe squarded where we would have, like, four super similar plots, but the statistics tell us a hidden message or something. Are you aware of anything like that?
Jessica HullmanI can't think of anything. Yeah. That is a strange question.
Moritz StefanerOne thing might be, like, sometimes we like fat tail distributions are really hard to see, but easy to.
Jessica HullmanYeah, I mean, like, yeah, that's a good example, I think, like, the behavior at the tails can really impact, like, you know, how you model data and stuff. Like it actually matters.
Moritz StefanerNever see it in a graph because.
Jessica HullmanYou might not notice it tiny and.
Moritz StefanerVery stretched out, but it still makes a difference, you know, stuff like that. So. So we might have biases towards really what is plotted well. Right in our analysis, probably, yeah.
Jessica HullmanInteresting. Yeah. I think of an Anscombe's quartet, I guess, is just like, you know, there's, like, statistics, like multiplicity of statistics and this, like, you know, like, that's why we visualize data. Like, you can have the same statistical summary, and the data looks very different. And I think it's kind of interesting, like, recently, this seems to come up more in, like, machine learning. Like, you can have, you know, multiple, like, you know, models, like fitted models that seem to do equally well, like on your test set or in your, like, sort of IIDA setting. But then when you probe them along ways that matter to humans, like, how do they deal with gender, et cetera, they can give you very different answers. So I think it's like, yeah, I mean, visualization in the sense of just trying to put the data in some form where you can bring your prior knowledge to bear, I think is like maybe Anscombe's quartet. We don't really think of it that way. It's like, oh, the answer is right there. They're all different. But I think it's like, visualization is often this first step towards letting us take what we know and try to apply it. I think we like to leave that kind of implicit. This is just people will bring in their knowledge and they'll know what to do next.
Moritz StefanerYeah. And you want to get from a lot of anecdotes to a theory or a model ultimately right.
Jessica HullmanIn either, which a lot of it is telling yourself stories, trying to explain things to yourself. So, yeah, I think we could give people better tools, though, like as they're telling themselves stories to make sure that their stories are kind of accurate. So, like, uncertainty visualization is sort of in that line, you know, like, let's.
Moritz StefanerShow it to you, maybe even record the stories, you know, all that stuff like the construction process, like the sense making, you know.
Enrico BertiniAnd it's interesting that the current tools don't seem to really include any special, I don't know, functions that help people reason more about either their prior knowledge or their beliefs or even in building models or externalizing their knowledge. I think there is a huge, there's an interesting space there where something could be done. And I think you, Jessica, did some work in that space where you ask people to explicitly first build a model or externalize their just a belief and then.
Moritz StefanerOr the, you draw at first, like, what do you think the statistics look like? Right.
Jessica HullmanYeah. Which I like that stuff. I mean, I think the open ended sort of you draw at first thing is interesting. The other stuff, you know, we did like eliciting priors where we would sort of have some bayesian model, and then we wanted to see how well does this model, this bayesian model of cognition, explains sort of what we actually see in terms of how people update their beliefs. And I mean, I think that's a good example of where having sort of a theoretical framework, even if it's wrong, like even if people deviate, because people do deviate from the rational invasion update in various ways, you can still learn a lot about how they're off. But, yeah, I think it's not that we don't design ways to incorporate prior knowledge. It's just they're all extremely implicit and even Tableau, I didn't know this for a while, but there's a whole analytics pane in Tableau where you can add regression lines and you can see intervals of various types, but it's very sort of rigid and constrained. Like you get some number of choices. And I think it's really hard though, like you want people to sort of. And actually my former student, now faculty Alex Kale and I have been working on something related to some of the ideas in the paper with Andrew where it's like, what would this new generation of visualization tools look like where you could come in and not be sort of like a seasoned statistician, but still use the tool to work up towards these preliminary statistical models that help you understand how much does this variable explain this other one, etcetera. So I think there's a really interesting space of how do you give them this scaffolding in visual analysis tools so that they can, rather than just their prior knowledge, drives them to click around in all different ways and their prior knowledge affects what graphs they draw, but in this very implicit way, how do we allow it? The tool give them back something the tool suggests. In our case, if they're looking at a certain combination of variables, the tool might suggest, do you want to try fitting a model to see how well you could predict this dependent variable based on the variables that you seem to think are important? I think there's a very big space, but the whole thing with eliciting people's beliefs and stuff also gets tricky. Ultimately, doing data analysis is hard already and cognitively overwhelming, so you can't be asking people a bunch of questions. And so I don't know. Yeah, it's an interesting, interesting space.
Enrico BertiniYou're making me think now about the idea that going back to the idea of exploratory data analysis and maybe an excessive focus to this idea that with visualization you can just explore data for the sake of it. Right. So it seems to me that maybe that led to designing these tools in a way that a person opening the application for the first time. Right. Can pretty much do anything with it. And if they don't do something, there's basically no guidance. Right. It's completely open, but maybe there's space for something that is more guided. In general, I think it's a under explored, underexplored modality where the system might actually guide you through a number of steps without being extensively rigid. Right.
Jessica HullmanYeah, we did. We've been building something that hopefully we'll have the paper out soon that's sort of an explore. It's kind of like a version of Tableau, but with built in model checking. And I think, yeah, it is trying to sort of give you tools that are a little more guided. Like, it's not telling you exactly what to do when. But one thing we see is that, you know, like you, you do sort of have to be careful about when you, for certain people, you know, like, I think if you come in knowing, like having sort of a statistical workflow, that you typically use, you might know that, like, I don't want to jump into model building right away. Like, I need to just look at things. Um, but then if you, and then I'll get to the model check stuff later. And we've seen some people use our tool in that way. Like they just create a bunch of graphs and then they sort of like call up the modeling part of the interface. But then we also see people, the ones who aren't as experienced with modeling, where they just want to jump in right away to building models. And I don't think that's good either. So it's like, yeah, the whole, I guess it's a user experience design type question. How do you gradually introduce things is going to matter in the end if.
Moritz StefanerYou just randomly apply models that somehow. Fitzhen and you have no theory of the domain. Right. And no idea of causality, it's always going to be a bit nonsensical and so.
Jessica HullmanRight. That is actually, that's something we see with this tool we built as well, like that some people, it's almost like if you come from sort of like an ML kind of, you know, background, you're, you just want to like be trying out, like just swapping in variables in some statistical model until you can like, you know, get the best predictive accuracy. And in a visualization context, you're trying to make sure that the predictions from the model that are plotted against the data best match the data. But that is not, I mean, it's sort of contrary to this idea that when we do exploratory data analysis, we're trying to really understand and maybe test our expectations about how the data were generated, which means we often do have in mind we don't just care about any variables. We need to be able to have some plausible explanation for why that variable might matter. So, yeah, it's, yeah, and you need.
Model Checks & Visualization AI generated chapter summary:
We talk about how to test visualization tools with non experts. How do you do model checks in a way that's sensitive to all the different ways things can be off? There's almost a type of visual literacy or data literacy that has to come along.
Jessica HullmanRight. That is actually, that's something we see with this tool we built as well, like that some people, it's almost like if you come from sort of like an ML kind of, you know, background, you're, you just want to like be trying out, like just swapping in variables in some statistical model until you can like, you know, get the best predictive accuracy. And in a visualization context, you're trying to make sure that the predictions from the model that are plotted against the data best match the data. But that is not, I mean, it's sort of contrary to this idea that when we do exploratory data analysis, we're trying to really understand and maybe test our expectations about how the data were generated, which means we often do have in mind we don't just care about any variables. We need to be able to have some plausible explanation for why that variable might matter. So, yeah, it's, yeah, and you need.
Moritz StefanerTo have a lot of knowledge about what's, again, what's a plausible range. Can this, can this value even be below zero? Right. So we had this case with the COVID excess mortality in Sweden, in Germany, where there were just different spline fitting techniques. And some of them were, they were just better, but you couldn't explain that mathematically. But more, you know a lot about the domain, you know. And so I think that's also when it gets interesting, when again, and then maybe it's again, a matter of being skilled at, you know, finding the right model by applying statistics and visualization, but also knowing what, what to look for and what has worked in the past and, you know, all that practical stuff.
Jessica HullmanTraining people on what to look for is another thing. Yeah. Like with, like trying to build model checking abilities into a visual analysis tool. Me and Alex Kale on some of that work. Like, one thing we ran into is people don't know if we're trying to do this for people who don't have a whole bunch of stats training and just have some exposure to linear models, maybe we got to teach them what different types of misfit look like. Because looking at predictions against observed data to check your model in a graphic is this very multidimensional thing. It's not just there's one way that predictions can deviate from or the observed data can deviate from the model predictions. There's many different things you can look for. How does the model do at the tails of the distribution? Is it biased overall? There's this whole way, almost a type of visual literacy or data literacy, I think that has to come along with tools that build more of this stuff in where you're helping people understand. How do you do model checks? Well, in a way that's sensitive to all the different ways things can be off.
Enrico BertiniYeah. This reminds me of a growing concern that I have over the years. I've become more and more concerned with the idea that we test visualization tools with non experts. I think there's a huge, huge limitation there. And I don't know, I think if you run an experiment that is based on very low level perception, maybe it's fine. You can pretty much any person is equivalent to another. Right. But as soon as you have some, you involve something. I mean, you test something that involves, that requires some domain knowledge in order to understand the data. Then it's like, I had experiments where they had both novices and actual practitioners who are familiar with the data. Just night and day. They're not even comparable. It's completely different. Kind of.
Jessica HullmanNo, I totally agree. Yeah. It's convenient samples usually. I mean, invis for sure.
Enrico BertiniYeah. Okay. Whole different topic.
Jessica HullmanYes.
Moritz StefanerCool. That was quite.
Enrico BertiniWow.
Moritz StefanerYeah, great episode. I like it.
Enrico BertiniSo many interesting things, each of these topics. We could go on for hours. Right.
Moritz StefanerIt's very fundamental.
Enrico BertiniYeah. Okay. Thanks so much. Thanks for coming on the show and hope to see you soon.
Jessica HullmanYeah, nice to chat with you guys.
Moritz StefanerYeah, wonderful. Thanks for joining us.
Enrico BertiniThank you. Thanks so much.
Moritz StefanerHey, folks, thanks for listening to data stories again. Before you leave, a few last notes, this show is crowdfunded and you can support us on patreon@patreon.com. Datastories where we publish monthly previews of upcoming episodes for our supporters. Or you can also send us a one time donation via PayPal at PayPal me Datastories or as a free way.
Enrico BertiniTo support the show. If you can spend a couple of minutes rating us on iTunes, that would be very helpful as well. And here's some information on the many ways you can get news directly from us. We are on Twitter, Facebook and Instagram, so follow us there for the latest updates. We have also a slack channel where you can chat with us directly and to sign up, go to our home page at Datastory ES and there you'll find a button at the bottom of.
Moritz StefanerThe page and there you can also subscribe to our email newsletter if you want to get news directly into your inbox and be notified whenever we publish a new episode.
Enrico BertiniThat's right, and we love to get in touch with our listeners. So let us know if you want to suggest a way to improve the show or know any amazing people you want us to invite or even have any project you want us to talk about.
Moritz StefanerYeah, absolutely. Don't hesitate to get in touch. Just send us an email at mailatastory es.
Enrico BertiniThat's all for now. See you next time, and thanks for listening to data stories.