data_dynamos4 / CORR_TEXT_ZOOM_VIS_1.txt
domro11's picture
All lecture files
a2594ba
Welcome everyone to data visualization. We thank you for submitting the survey, and sixty of you have never worked with data visualization. This is neither good nor bad, but today we will keep away the lack of familiarity with the subject and give you a sense of the directions you might take. Back in the day, I worked at a data station company called Carlo, which became a global leader in data visualization. We identified a huge gap between humans and numbers, and we used to say that we were lucky enough to work with Google, UN, and NASA. We used an example of the size of the moon to illustrate this gap. When someone asked what the size of the moon was, no one knew, but when we said it was 3,400 kilometers, the same size as Australia, it made sense. Data visualization has gone through a huge tipping point with Covid, as people consume millions of data points through different graphs and charts on a daily basis. We are looking at hundreds of thousands, if not millions, of data points, which include intensity, date, and location. We will be looking at different ways to tell stories with data, such as this map which shows the time lapse of how the Covid outbreak spread, starting in Wuhan and then moving to Italy, Spain, and the US.How can humans make sense of so much data so quickly? One pro I was part of back in the day with Twitter was a project that visualized data in a time lapse geographically. We had hundreds of millions of tweets and we could see the entire evolution of the Ice Bucket Challenge campaign. It started in Boston and spread through the East Coast of the US, then the UK, and eventually the entire world. This was a great way to communicate such a large amount of data in a short period of time. Another example is how we can better make sense of relationships. The New York Times has been a leader in data visualization, and someone from their main data station practitioner created P3, a Javascript library for data visualization. For example, if we have more than 800 players, how can we make sense of the relationships between them if not through visualizing the data? We will be covering this again and again in the course. As a data scientist, we should graph the data before analyzing it. This is something we will be doing throughout the course, employing visual techniques to make sense of the data and draw accurate conclusions.Again, we're going to really sharpen our skills on expiratory data analysis every time we receive a data set. How can we make sense of it and of the data that we have in our hands as soon as possible? That's something that we'll be doing as well, because in the end, data visualization forces us to see things and things that we didn't expect to be things. Many times, we go through the numbers and barely through prices. If we see a standard deviation and we get a value, it might surprise us. But again, once we graph it, we might get a lot of aha moments. The most important thing here to recognize is that the root of visual perception is knowledge. Understanding the revision functions is an extremely fast process, whereas if we think about the cognitive processes, the act of thinking, that is way much slower. Let me give you an example. If we count how many Vs are here, and please mute and try to answer in the chat, how many Vs do you see here? Maybe. So it took us seven or eight seconds and we had different answers, and nobody was entirely sure. What if I just do this? Do you have any doubts now of how many Vs there are? No? Why? Because right now, again, you're not using your cognition. You're just using your perception. We'll have a class to some cognition and perception. Here, we're just using our perceptual perception to spot the Vs just by encoding. We're supporting again those differences faster. We don't need to go in an inquisitional way. Is it a V? No, let's go to the next one. Is it a V? No, next one. If we just encode information in the proper way, we just understand. We just see that and we can count easily before these or anything that we want, and that's also something that we'll learn throughout the course. How can we better encode and present information in a way that what we want really stands out to our audience? That part is crucial. And also, why is people maybe trying to make some information stand out? Are they trying to fool us or are they trying to deceive us? So also, there will be an ethical part that we will cover throughout the course. We'll go back again to the census. Again, seventy percent of the body's receptor receptors are in our eyes. We need to think about the human visual system as a pattern seeker that is extremely powerful, and for us to use that figures, shapes, or way easier, more accessible, and easier to understand, and faster to grasp. They're also more memorable, more fun, and they're more influential as well. That's why, probably, if we think, if in the future someone asks us about the COVID outbreak and the geographic spread, we might think about that visualization. We might not think in numbers. We're visual creatures. That is something that we need to think about. That's why again our eyes can process information faster, and they can do it again in part of what we think. We first proceed through our eyes, and before we actually come to the brain, we already jump into conclusions. It's called pre attentive. Again, we'll cover this more in depth in Section Five: Pre-Attentive. Why? Because it happens before we actually think about it, we already perceive the four Vs. So it's very powerful for pattern recognition. Any questions so far? So far, so good. Okay. So let's take a step back. Let's become a little bit academic here and on what is visualization per se. It is the process of transforming data, which is in most of the time abstract, into graphical representations that might be interactive or not.We're actually going to be using both interactive and static wraps, charts, and other objects, not just for the purpose of making the data beautiful, but also for the purpose of confirming our assumptions and presenting information. We're going to be mostly focusing on exploration and presentation throughout the course. Most people, when they get into this course, think that data visualization is all about design and making beautiful things. However, data visualization is not just about design; it is also about having the right information. If you don't have good data and good information, your visualizations will not be good. We're going to cover a lot of statistics and design in the class, but the most important thing is why we want to visualize. There are two main buckets: communication and exploration. Communication is about discovering and understanding stories in the data and presenting them to others in a graphical form. Exploration is about making patterns and exceptions in numbers visible and extending our capacity to process this information. Even our entire visual language has become visual, with emojis, memes, and other visuals that resonate and stick in our heads.If we're going to the road and we're starting, as I said before, in a sequential way to process all the different signs, by the time we finish processing that we're way past the exit. What visualizations allow us to do is to help us think, to hold in our minds way larger volumes of information than if we did based on numbers or even on words. So again, before the session is going to help us think as well. But then it also helps us again to persuade and make a point, and I'm going to open the room for you. What do you see in this graph? And remember, there are no good or bad answers. Just what do you see? Okay, there you go. So here you want to make a point, you want to persuade just by encoding or by changing the intensity of the quarter. Everyone is looking at Amazon. It's as simple as that. And again, we're going to learn how to encode data in different ways through size, motion, tilt, and many different ways. So it turns out. But here we're going to be able to persuade, but at the same time it's sometimes about creating a common vision and lining people around the actions that need to be taken. Just think about the rise of Netflix and the demise of Blockbuster. How can you better tell this story than this? You could talk for hours about how Netflix and back in the day, let's put ourselves ten, twelve, fifteen years back in time. We could talk for hours about how Netflix was the next big thing and how Blockbuster was really going to crash. But until you create this type of graph, how can you argue against this? So data visualization is also very powerful again to confuse people and align them under a common vision. Have you seen the Hans Rosling video? No, nobody because other classes sometimes they have. I did, but I don't remember. It's memorable. Okay, so Hans Rosling was a doctor and a physician who specialized in statistics. There was a Ted Talk in 2006 that was a huge pick point for him. I'm going to show it to you. It's only twenty minutes, but we're just going to look at it for two minutes. It's brilliant. He also published a book that I really recommend called Factfulness. Bill Gates says it's the best book he's read. You're going to laugh as well, because this guy is hilarious. Let me know if you can hear the sound or not. Thirty years in some kind. You can hear it? Right? Okay, we're going to watch this for just a minute or two, but it's up to about seventy years. In 1962, there were really two types of countries: industrialized countries with small families and long lives, and developing countries with large families and relatively short lives. Now, what has happened since 1962? We want to see the change of the students, right? It's still two types of countries. Have these developing countries got smaller families and longer lives, or have they got longer lives? So basically, what he was showing was challenging the common belief that there were two worlds, the developed world and the developing world, and that was a belief that was created in the sixties. He showed through data how what some people call the developing world or undeveloped world actually cut up with the developed world, and he did it again through data. He also created a tool called Gapminder, and at some point, even Google bought the software. Right now, it's actually property of Google. So it was again a very important point for data visualization. Any thoughts? Okay, let's continue. Please.Again, any questions, any thoughts, anything? Just stop me. Put something in the chat. Don't hesitate. So again, that talk from Hash for Thing was a great example of how to convince people that they were not in the world anymore. Now, right now, we're in the 2000s. We don't have an undeveloped world anymore. All those countries have really come a long way, contrary to what was the common belief. And again, he did it through data. They say they can come from UN. They have been gathering data even from before the 1960s. So again, this is all based on data, but the second part, and that's also something that we're going to be covering a lot, is exploration. Again, just think about your role as data scientists. You receive a dataset. You need to make sense of it. We need to explore numbers. We need to make those relationships between attributes visible. We need to understand again the patterns in data. So we're going to be looking at different examples of how we can better explore data visually. Has anyone heard about the real John Snow, not the Game of Thrones one? No? You're going to love this. Nobody knows about it. Okay, so John Snow was actually a real person. He was a physician and doctor and researcher back in the 1800s. Basically, London suffered a huge cholera outbreak, and thousands of people died because of this. So there were, of course, all the city officials, all the researchers, universities were involved in how to think about it. All, how everyone puts focus on Covid. So imagine in the UK or in Great Britain back in the day, thousands of researchers really trying to understand what was the cause and where did this outbreak come from? And after again hundreds of studies, and nobody could really get a grasp of what the main cause was and how and why it was spreading. A physician and doctor was actually very skeptical, because the dominant theory was that the disease spread through air, and he was not that convinced, but that was the cause. So he began tracing the source of the disease. What he did is he actually created different maps of the city and began tracing again these the cases that happened in the city, and he started identifying some clusters here and there, and then he started overlaying different data points. It could be buildings, it could be hospitals. At some point, he noticed a water pump in the city. Voila! He discovered that this was the source of the outbreak, a public water pump on a specific street called Broad Street. So, aside from just getting to that conclusion through effective data visualization, I mean, that was the first step getting to that conclusion. The second step was actually to convince again the government that this was the source, right? We need to remove again the water pumps. We need to take action. So what he did, actually, he experimented again with different visualizations. And finally he came again with this map that became probably the best case we have in history about the expiratory data analysis, because it saved again millions of lives just through visualizing data. Any questions about this? Okay, some people again talk about the beginning of data visualization not this way, but data exploration. Again, as this case, there are others as well that we will see throughout the course. But again, this was probably again a major point. I need to keep three. And of course, again, there were several contributors and a lot of people called John Snow after this structure. The second thing that we're going to be talking about again is how to understand better patterns, how to make sense of them. That's pattern recognition is always the bread and butter of data exploration. Right? If we look at this very simple graph...We looked at the train name of the Lily in the US. What do we see? I want to start putting more on, you guys. Correct? Correct exactly, and we can just put that very easily. The conversation in this case was about the correlation between the name Lily Angish, which was a very famous actor in the 80s, and its popularity. With the rise of the internet and people researching cool names, it became popular again. We can understand these patterns very quickly. We can also show relationships between variables, such as the correlation between work, travel, and stress. The graph shows that there is no direct correlation between one or the other stress levels. Even with one trip per year, stress levels can remain very low, or they can be very high as the number of trips increases. We can also look at qualitative data and display relationships that are not quantitative in nature, such as the connections between different characters in a story. We can represent this data as notes and circles, with the relationships between them represented as lines. Throughout the course, we will be covering how to better communicate, understand, and explore data. We will be doing this mostly in sessions three, four, and six.Yeah, that was very cool. So many ideas. Okay, so we're going to go through this exercise. It's about soccer or football, so sorry to those who don't like the topic, but it's still valuable. Even if we talk about elections, don't think about the content too much. Just say what you see. If your kid, niece, or friend comes to you and shows you that graph, what is the first thing that comes to mind? Okay, so we can go ahead. Performance evolution, effectiveness. As he played games, the goals and games increased; how his skills developed across time, monthly comparison, performance goals, and higher band games. Recently, the more experience he got, the better his first games were, more games and goals later. The opposite performance improvement was most effective in 2014, an eight-year improvement. There's almost a direct correlation between both. His performance in the team increases. There's something not ordinary in 2014-15. This year, performance picked up. What else? That's pretty good. What else? I'm sure we can squeeze more information out of this graph. Correct injury. Maybe there might have been an injury in 2013. Look at all the information we said in just a period of two minutes, and with such a simple graph. It's just a simple bar graph side by side. I look at all the information it's presenting. But let's go and think about the process, and how we decoded the graph. Get the gap between his games and goals trends to be steady. It's true to start in 2011, and still we're coming up with ideas. We're coming up with new conclusions or new data points, new pieces of information as we keep looking at the graph. And again, this graph doesn't have anything that really stands out. Look at all the information it's presenting. But let's go and think about the process, and how we decoded the graph. Get the gap between his games and goals trends to be steady. It's true to start in 2011, and still we're coming up with ideas. We're coming up with new conclusions or new data points, new pieces of information as we keep looking at the graph. And again, this graph doesn't have anything that really stands out. Look at all the information it's presenting. But let's go and think about the process, and how we decoded the graph. We perceive the graph, talk about evolution of time increase. That is the first thing we do when we look at a graph and understanding how people decode and consume visualizations will allow us to better understand the data and do it in a way that is effective because it's all about data effectiveness, how to make the data stand out, not the design. This is a very simple design, but it conveys a lot of information. We're going to be looking at ugly graphs as well. We're going to see that not today. But if we had, for example, a photo, would that add more information to it? Or if we saw a ball, would that add more information? No. Here again, we're just looking at the data. But, as we said before, first we perceive. We're seeing that upward trend at some point. The trend decreases. First, we just perceive high-level trends. That's how our brain works. And again, we are not really making a lot of sense yet of the data. The second point is what we call interpretation. We start saying, okay, what does it mean that there is an upward trend? There were comments around. Okay, the ratio between goals and games. So what does that mean? That's when we start interpreting the data. We start making sense of it. We start saying, okay, what does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean? What does this mean?Maybe something happened in 2013, and then our cognition started; I mean, that's when our condition started working. Some people have previous experience and acquired knowledge. Some people may not even know who Messi is, and that's fine. But the ones that do, the ones that have some knowledge, start using that knowledge as well, and they start comprehending. They say, "Maybe he was injured. Maybe this year he played the World Cup. Maybe this year he had a different coach." They start putting their experience and knowledge to work, and that's where perception meets cognition, and that's what we get to knowledge. So this is the process that we follow every time we see a graph. First, we perceive and start interpreting it. Then, we get to create knowledge. We comprehend the graph regardless of the topic. There's actually a graph with movies and ratings of movies from Tarantino that I don't have here. But regardless of the topic, anyone can make sense of this graph, and that's the most important thing. We don't need to go through a mess or anything like that. Let me show you an example, actually talking about perception. Any questions or comments? In the meantime, I'm going to ask you to tell me what you think about this visualization. Click on it and let's do the same. Okay, very quickly. Let me see. Maybe tell me what you see first, and then we'll start comprehending the graph. This was featured in the Washington Post and was highly controversial and became a very famous YouRex. What do you see? But what do you see? There you go. But let's look at it closer. So it talks about, for example, the big contrast that we see in this graph. Okay, this is Spike. What else? I don't know who mentioned it before, but we can see mostly the contrast. And this is what the idea that this graph wants to show. The contrast between the collision depth for minors and the civilian depth is pretty sad. The difference is not one to ten; it's even larger than that. It's crazy. That's what this graph was showing. But how do they do it in a very emotional way? Someone said that it was trying to be sensationalist. But actually, because, as I said before, there was a lot of controversy around this graph, this graph follows all the principles of effective data visualization. It's trying to make an impact, but there's no gimmick or cheating. If you think about it, it's just an inverted Instagram. It's as simple as that. It's an inverted historian, which they decided to make the design decision of making the lines red. But this is just showing data, just by inverting the Y-axis and making that choice of color. Look at the impact that it's creating. Good point. And again, that's the beauty of the visualization; there could be so many interpretations. I always say that creating a graph is like telling a joke. You need to know who the audience is going to be and adapt it to that audience. People might take it differently. We're going to learn how to do it in the most effective and appealing way for the largest amount of people. But you can never please everyone, and not everyone will interpret data in the same way.Yes, that is also great. We're thinking in the end that visualization should force us to think more about the topic. Hey, how is this correlated with other variables? Is there any causation behind it? Are there any more comments around this? Very sure point. But then again, that's your cognition telling you. Yeah, you have some background, and you know the story. But at first, I'm sure that, as most of the people, you got shocked because it didn't look like a lot. But that's great. We need to have a critical eye every time we consume information, and people normally don't do it. We're going to be doing, I believe it's actually next class, we're going to be looking at how journalists, but also companies, lie through visualization and it's just so unethical. The thing is, we fall into the trap so many times. But we're going to be training our analytical eye and variable mindset in order to be able to perceive those things and how not to be fooled. We're not going to be showing you how to fool people, but at least how not to be fooled. Let's see if, because let me see if it's around here. It was very interesting because the difference is, yeah, right now. Now let me see. It's the same data could be told in such different ways. So because actually this is a declining situation, it's improving overall, but it's all about perspective and it's all again how we present the data. Here, just by changing the color and just using the normal Y-axis, we can communicate a totally different story, totally different. Very good point, but it's even interesting because nobody said that you actually did that, for declining is a downward trend. Nobody even said that just because the graph is inverted on the left hand side. So we don't see that, we just see a presumption to see if it's dropping. So again, design choices can make a huge impact on how people perceive and process data and information. Let's continue. Okay, any questions so far, any comments? Okay. So also it's important. Let me ask you the question: when are we going to also learn when not to visualize? No, not every time we get numbers do we need to visualize them, for every time we get data we need to visualize it. So when is it okay not to visualize data? What situations do you think there's no need to visualize data? For example? Okay, can you repeat that? Sorry. Okay, fair enough. One hundred percent exactly. Yes, correct. Yeah, all those answers are correct, and that's again why, and I will never push you just to visualize for the sake of visualizing. We need to be also strategic about it. We just think, okay, does it make sense, or does it not make sense? And there's also some specific questions which again, we don't need data. If you tell me that Barbara is the best student, that's it. We don't need to visualize it. If it's, I am not the same. If it's whoever that is, it's okay. We don't need to visualize that also. If we're comparing two specific sets of values, we or just showing, we need to show again the values so people can compare them. Set of numbers we don't see it visualized. Sometimes we see the graph, but many times we don't because you need the number in order to make it efficient. So just think again, when you're going to need to visualize or not. Okay, that's something to keep in mind. Okay, we have which is going to go until four fifteen for twenty at the latest. Um, so let's keep moving.Why do you need to visualize data again? Why is there a need for visualizing data? They can be so big. There has been a huge evolution in this space in the last five years, with many offers for data visualization experts in large companies. It is now a requirement to know how to show and share data. I think the space has reached a plateau in the last three years, with only small incremental changes. There is more data available than ever, making it more relevant to make sense of the data and requiring people with the right skills. Technology has also advanced, making it easier to visualize and understand data. There is also a cultural shift towards data-driven decisions and transparency. An example of a groundbreaking data visualization tool is a dashboard created for the Mayor of New York in 2014, which allowed him to understand what was happening in the city.Data visualization allows us to extend our capacity to sync and store information, understand complex problems, ask questions, generate insights, and communicate effectively. It is becoming increasingly important for people to understand data, and taking a master's degree in this subject is a great decision. It won't be easy, but it will be worth it. In the next eight minutes, we will be learning about cloud data visualization and how to communicate it.We are going to learn how to pretend data from different domains, such as puritory data, text data, image, or graphic data. We will be going through some theory, but also have fun with the principles of data visualization. We will learn how to better present information in a way that maximizes the perception and cognition of our audience, how to navigate through data effectively, and how to be a data visualization expert. We will also learn how to create powerful dashboards and visualizations that can make an impact. There will be some theory, mostly around graphical excellence and how to accomplish it. We will also be creating a lot of charts and graphs in the class. Class participation is very important, not just for your grades, but also because it adds value to the class itself and to your colleagues. There is no such thing as a bad comment, and all perspectives are extremely valuable. Please do not hesitate to share your thoughts. Everyone's comments are valuable, and not all of them have to be brilliant. We will also be going on the Forum, and I will adapt to your pace. There will be two individual assignments, and the final one will be creating an in charge project in Taglo. Everyone is expected to participate and add value to the conversation. We will mostly cover exploratory data analysis that can be done in Python. The last two weeks will be focused on Python, and the material will be shared after the class.We're in Tools 5, not 4 anymore. We'll be using GC to the version in Python and Tableau. All of them are free and open source. I'll give you a free license for Tableau. It's a good tool. Data scientists don't need a tool like that. I just started two months ago as a General Manager at Microsoft for startups. I manage the relationship with big scale-ups and unicorns all over Europe. Most of them have big data visualization parts, but it's still good to learn how each of them approaches it differently. I've been in the US scaling different companies, mostly in the night space, like Carto. I'm also a computer and data scientist, although I don't work as a data scientist anymore. I've been mostly on the research and business side. I do research not just on the database system itself, but mostly on open data. I'm part of the French Beta communities as well. I'm happy to help as much as possible, so don't be a stranger. Reach out. I try to be approachable, so give me 24 hours to respond, but it shouldn't take more than that. For the class, I'm creating a visualization project. I don't know the policies, but I'm happy to help as much as possible. The final project is mostly for you to enjoy. I let people choose their groups or pairs, and the tools they use. We'll have around five to six weeks to work on it, so there will be plenty of time. Tablo has been the leader in data visualization, even though Power BI is more powerful from an analytics standpoint. Tablo is also good to learn how to apply graphical excellence principles.So, in Tablo, basically you can create any visualization you want right out of the box, or you can fully customize it. We're going to learn how to do both, using the principles of practical elegance. Can you use that for the Pine project? Yes, it was supposed to be. Let me check, because I don't know why I was using the old online campus, but they told me that the data was migrated to the new one. Let me check, if not, it will be uploaded today. Any other questions? Well, it's Saturday, 4 PM, so I'm sure you're all looking forward to enjoying the weekend, so I won't take up any more of your time. It was great meeting you all. I'm very excited about the course and seeing what you come up with, and looking forward to making great things happen. Fantastic. Thank you very much.