Spaces:
Runtime error
Runtime error
Hello! Good morning! How are you? Fine. Hello! Let's see if the rest of your mates are coming. Were you here yesterday? What did you do? I think you come and have classes, or you have classes with a lot of technology right now. Yes, I mean it. So what electives did you take? What have you learned? What are you expecting on the blurry? Is anyone of you coming? Okay, so let's dive into this. Hey, what do you like about deep learning? Do you want to explore it further? Is that the reason, or do you want to create a company and get into AI? You see that the demand is now estimated at a business of 2-3 billion just from the start. Yes, they put that cost at around $45 per month or so. This is a new incredible business. Well, it's based on large language models like GPT-3. So you are in the side of natural language processing. You will see something like that. Well, I think that we are all here, we are 22, so we are 26 if I find it alone. But okay, let's start. I think I'll share my screen. Can you see my slides? Perfect. No, I need to adjust this. Okay, perfect. So welcome to a new class of deep learning, where we are going to focus on deep neural nets. But what we are going to do is try to understand the underlying concepts or the fundamental concept of deep learning, which is nothing more than other machine learning techniques. And this is what we are going to start today by trying to understand in this session, which are the deep learning ingredients which are common to all of our procedures and algorithms in machine learning. We will also talk a little bit about me, the class, and the course. Then we will talk about those deep learning ingredients and try to see the most simple deep learning, probably linear regression, with those sides, those new sides, for understanding what we are doing, and then trying to unbox gradient descent, which is probably the most famous algorithm for doing the training procedure in machine learning, and is the one that we will use also in deep learning with some flavors we will see. A little bit about me, I did my master's degree in computer science, and then I moved to the dark side for the computer science community, which is going to electrical or telecommunication and computer engineering. My PhD was about machine learning, because it was about classifying speech speakers by using machine learning techniques. And in that moment, when I was at Google, doing this in New York, in the speech group, we started to use neural nets. Before that, we were using other classical methods, like hidden Markov models, support vector machines, logistic regression. So we started to use neural nets and I was like, wow, this is really cool. So I started to understand the underlying concepts and the fundamental concept of deep learning. And this is what we are going to start today by trying to understand in this session, which are the deep learning ingredients which are common to all of our procedures and algorithms in machine learning. We will also talk a little bit about the class and the course. Then we will talk about those deep learning ingredients and try to see the most simple deep learning, probably linear regression, with those sides, those new sides, for understanding what we are doing, and then trying to unbox gradient descent, which is probably the most famous algorithm for doing the training procedure in machine learning, and is the one that we will use also in deep learning with some flavors we will see.There was a lot of engineering behind that for making this complex signal, the speeds able to pass from those systems, and it was around 2010 when an intern from the University of Toronto, Jeffrey Hinton, one of the pioneers of Deep Learning, came to New York and started to work with some benchmarks in English from the Acr System Automatic Speech Recognition System. He got pretty nice results just using deep learning, and that was like a turning point where we could start to see those deep learning structures working with speech. That's kind of the history that I will tell you during this course, but you will see that neural nets help to outperform all of our classical systems in speech and speaker recognition, computer vision, and other problems where we have a lot of data, a lot of variability, and complex input signals. They are amazing. So what is the course about? What are we trying to do? Well, within the course we are fundamentally talking about neural networks, but it's kind of basic because we only have 6 weeks. Nevertheless, some of you are more advanced and you may ask me for more complex things. Let me know, because I will be very happy to answer those kinds of questions, but we will try to go step by step for understanding the most important things. After this course, I think you will have a very good idea about deep learning, how to build your own neural nets, how to play with them, or deploy them into your cloud processor, and all those kinds of things. What are the things that we are going to explore? The fundamentals behind them is something that we will start today. Why do we call it deep learning or neural nets? What are we talking about? And why are they so important or successful in some tasks? We will learn that cost. But in terms of inference, where you are putting your system to deploy and test, it can be very simple because finally it's about matrix multiplication. Once you train your neural net and you have the weights, and you freeze the neural net, the race is just doing matrix multiplication. So no matter if you probably heard that they spent more than 4 million dollars and used a lot of GPUs and CPUs and different hardware to make the training faster, probably the training took more than 2 weeks. But once you have the weights into your neural net, when you finish the training, finally what you have is a bunch of numbers that you can put in matrices, and the inference time is not so bad. So that's why they are very suitable for the industry. We are also going to explore the most famous architectures and standards in neural nets. We will start with the multi-layer perceptron, which is the one that we have in our brain when we start to see those kinds of things, the neural net, and then we will move to convolutional neural nets, and then to recurrent neural nets. And we will finish with all the supervised models which are very interesting right now. So in our way, we are going to revisit this deep learning fundamentals which are very important for understanding the rest that we are going to do. And we are going to work in our notebooks with TensorFlow and PyTorch, which are some libraries that everyone is using right now. This is going to allow us to build neural nets, and you will see that it's a simple way to use it. At some point, I will put some examples with PyTorch.The most common and important libraries for deep learning right now are TensorFlow and CEDAs, written in Python. In 2010, Google started a C library for deep neural nets called Disbelief (or something like that). The problem with C was that it was very difficult for researchers at Google to work with. Python is much easier to work with for experimenting and changing things. So, what the people at Google Brain did was to create a high-level library in Python called TensorFlow to call this C code. This is something that happens in the industry every day. The C library was using Fortran libraries for all the low-level code libraries, for example, for matrix multiplication. Now you are familiar with Fortran, as it is the faster one and something that has been around since the 1950s. So, the C is wrapped up from some Fortran code, and then the Python is over the C. Finally, what you have is a library that allows the researchers to do all kinds of things in an easier way, which is what we are going to use in our class. I want to be clear that this is not a course about maths or coding. Of course, we are going to see a little bit of maths and you can ask me if you want to go deeper into whatever method I want you to understand. What is important is that you are trying to maximize and maintain when classifying data. We are going to do some coding, but in most cases, 90% of the code will be provided by me. I want you to have the notebooks and do your own work, but I don't want you to spend a lot of time creating your own code. What is important is understanding the concepts and learning how to apply them. This field is changing very fast, so the only way to stay up to date is to understand what is behind all these technologies. It is not about using this as a black box, because if you use this as a black box, it will not be worth it in 5 years. People are doing totally different things in terms of how they build things, but the underlying ideas are still the same. So, I want you to understand what you are doing, what the main concepts are, and learn how to apply them. Does that sound good to you?Oh, is someone there? Well, everyone, I just had a question about the code you provided. Will there be any sections for visualization, and how will the units behave? We didn't start with that, unless you want to visualize what is going on there now. That would certainly be something interesting. So when you start with the 54-1 neural net, you want to visualize something. It's pretty difficult to get a full understanding of what is going on there, and we will learn why. Basically, because the information is very well distributed across the net in different levels. So it's very difficult to have an intuition about what is going on. You have global metrics about what is going on, but you don't have a visualization of what is going on. But when we move to convolutional neural nets, unless we are working with images in every layer of a convolutional neural net, basically, what you have is an output that you can interpret like images. So in this case you can visualize more or less what the convolutional neural net is doing, and you can have some intuition about it. Even so, again, the information is going to be distributed. This is a hot topic right now in research, trying to visualize or understand what the neural net is doing, because it's a big black box. Indeed, and it's hard for us to understand what is going on subtly, because there are a lot of connections and a lot of different information in several levels, so it's not easy for us to understand what is going on. So this explainability, or the visualization of the neural net, is something that we want to explore because it's important. In some cases it's very important to understand what is going on subtly, as you have in decision trees. You know, when you're working in your company and you create a decision tree, or you're doing boosting, finally you have some rules and you can explain to the CEO, "Well, we are doing this for this rule, these, these, these, and we are combining this or doing boosting, but finally it's based on this algorithm with my data." It's not so easy to do that with a neural net. And so it's a very good question, and we will see how, in combination, we can have some idea, and in some moment we can have some intuition. But in all those layers, it's pretty difficult to see what is behind all this. So a well, yes, for putting some context, I like to say artificial intelligence just for putting the countries where they are learning now. Albany, I like to say artificial intelligence: are machines behaving like humans? Okay? And you can use rules and other kinds of things, but the idea is that artificial intelligence is behaving like a human. A tool for doing artificial intelligence between all the things and machine learning is about machines that are able to learn, because finally you want to have a team behave like a human. You want the machine to be able to learn. That's why machine learning. We used to put it in the scope of artificial intelligence. Well, deep learning is just another way to do that. The machine is able to learn. Equally, there are other methods that you learn, like Gaussian Mixture Models, which is just another way to do it. It's probably more interesting, because we are going to work with different levels of knowledge. And working with different levels of knowledge is always very interesting for understanding complex concepts, like how our brain works in different layers of knowledge. And what is the power of that? The part of all of that is that in every layer we are going to be able to probably solve small problems. And then we are going to be able to combine all those small problems to solve a big problem.An easier problem is that the final problem? But those small problems are related to the final one, and in the next level of knowledge we will combine those outputs. So when you have a hierarchical structure of learning, what you have finally is a divide and conquer algorithm or strategy. In the first level, you are trying to understand or solve easier problems, and then in the next level of layer you are combining this information, and in the next level you are combining the output of the preceding layer. So it's a divide and conquer strategy, as it's working in our brain. It uses super vector machines or Gaussian models. So if you have just one level, y = Wx + b is one level of knowledge. Then you don't have all the layers for combining this. Well, we will see this in detail in our classes, in the following year, when we focus on deep learning. But just for now, deep learning is a method of machine learning, and we use it for creating artificial intelligence. If you don't have a lot of data, and the problem is easier than you think, probably you do, you can use regression or a support vector machine. That works better when you have a lot of data, and when you have a problem that is really difficult and has a lot of variability in the input. Okay, so you know. In machine learning, there is no single algorithm or technique that works for all problems. If you don't have a lot of data, then you can use regression or a support vector machine. That works better when you have a lot of data, and when you have a problem that is really difficult and has a lot of variability in the input. Okay, so you know. In machine learning, there is no single algorithm or technique that works for all problems. If you have complex signals, complex inputs where the inputs can have a lot of variability, such as face recognition or image classification, then a neural net is the best approach. We will see this in detail in our classes, in the following year, when we focus on deep learning. But just for now, deep learning is a method of machine learning, and we use it for creating artificial intelligence. If you have just tabular data, and you have variables which can be continuous and discrete, then a neural net is probably what you are looking for.What if I change the features? And I change the Cartesian coordinates into polar coordinates while you put this Cartesian into Polar? Finally, what you have is something that is very easy to classify. Well, this is some of the things that the neural net is going to do for us; trying to chain the features, doing feature transformation to try and solve our problem in an easier way. We will see these. Definitely, it's a block for us. We will see that we are going to put here to enable weights and we are going to learn trainable weights from here. We will see this step by step. But finally, what we are going to have here at the output is a cost function, a mathematical function, and what we want is to minimize this cost function. So we are going to have a cost here, and we want to minimize it. The neural net itself, by training using gradient descent, will choose those weights for us and what it is doing in every layer is going to be a kind of mystery. In some moments we are able to plot it if we are working with convolutional neural nets or something like that. But in most cases the information is going to be really distributed. We are going to have a lot of layers, a lot of neurons. We will see that. And finally, it's very difficult to see what is going on from one layer to the next layer. We are going to have some intuition in some moments, but at the moment it's very difficult to see what is going on. Okay, that is about this plan. The ability of the neural net or the visualization, which is not a very easy topic. Okay, so let's move on to this. What other things are we going to do? Well, when we talk about convolutional neural nets, we will work with classification, trying to classify images. This is a cat or a dog, or what is this? We are going to also see algorithms for doing classification and localization. So it's not only this is a, but this is the middle point of this subject. This is the width, and this is the height. So I can plot the bounding boxes. Okay, of this in this image. And we can do something more complex like object detection, where we are not looking for just one single object. We are looking for different classes, cats, dogs, and this and that, and we are doing the identification, and also we are going to put the class and the location, the bounding boxes. And right now we have algorithms also for doing instance segmentation. It's a very hard problem. It's a tough problem. What we are looking for in an image is not only they are here, but also which pixels belong to any other here. Okay. And you can see, well, for this a sample it's not very interesting. Probably it's okay with having this I did that in. But right now we are using this. It's a segmentation for working with medical images. And I imagine that you are looking at a lung or a brain. It's not the same just to put in the bounding boxes than saying, if subtly in these pixels you have your tumor. It's much more interesting for the medical doctor that has to supervise this. Okay. It's a harder problem, and it's a for the problem, because every single image in your training data set has to be labeled. We every single pixel has to be labeled into a noted background, or cat, or dog, or duck. So finally, you need more labeled data in that way, which is expensive, but it's much more interesting. The final solution. It's also right now used in autonomous driving, because it's not only well, this is the pedestrian, or this is the road. No, give me the exactly the pixels, because it's much more accurate. Okay, you are working with that car that has to be. You know, a. This is a comedy in avenue at the 1593, and this is real time of the detecting. I don't know you see the the boxes, but this is a deep neural net based on convolutional neural nets, is one of the things that we are going to learn to do.Basically, in every object with different colors, blue is from the cars, goodness from the pedestrians, and they are all on the road. You can see a number and probably you can see a number put the class and the probability to be from that class. This is used for counting the different cars and people that are passing by a supermarket, something like that. This was unthinkable 10 years ago. The album that we have in computers is based on working with, not in real time or at least with the accuracy that we want. And now this is a reality. There is a mythical part here, because it's a little bit discouraged, because we will learn that you can do all the detecting. But you can do also face verification just with one image. But it's a mythical part that probably the community has to discuss, because everyone can do those kinds of systems right now. Autonomous driving is much more difficult than before, with much more classes, different environments, snow, rain, clouds. It's also measuring the distance. In the bonding box you have something like 33 meters reading signals. Well, this is also a neural net data convolutional neural net working on real time. Right now, the state of the art is deep learning, definitely, with some flavors of the different topologies of deep learning. I saw all their linear classifiers using grids. So you take the image, do a grid and try for every grid trying to do an AI. I did that tune using a support vector machine or something like that. Hmm. But right now, from 2012, what was the first convolutional neural network with a large image database? From that moment, I think all the people this is in. Okay. So, more about the class before going into the things that we have to see today about the ingredients of machine learning. The rules are very easy: respect, do your own work because it's not going to be worth it, and always assume the best of people. This is especially interesting because you are a very heterogeneous group. Some people are coming from engineers, other people are coming from business, and finally, I used to have classes also in the university just with engineers, and it's good for math, but it's not as good when they think about the projects that can be done or the things that can be done with the tools. It's kind of different. And here, what you have is an incredible group where you have a lot of seniors. So, you assume the best of people. Finally, what do you have? It's going to be much better than what all of you in a separated or isolated. Evaluation. We are going to have always these with 20% for class participation. We have a lot of forums, so you can participate a lot in the forums. For instance, I will put in the forums different videos for concepts where I will ask for doing some small exercises. Don't worry, even if you solve the exercise, please do it again by your own. It's really important to do those kinds of things. So, if I put an exercise and you want to do it, even if 20 people before you do it, do it because it's a way that you are demonstrating that you can do it, and also that you are participating in practice work groups. We are going to do two practices because we have just these weeks. They will be practices that you can do in two weeks. It's not a big project yet. And then we have the final exam, which is going to be very easy for you because you will know at this moment all the concepts that we are going to use. And I want you to learn, and it will be a test type. It's going to be easy. Okay, now we have the final exam that you are going to nail. Yes, one single answer is the right one. Okay. I know.So you want to put several of them, none of both, and all those kinds of things. I didn't like it in my days of a season, so I'm not doing that. You see. Believe me, they will see that. I'm not really sure. Did you do the, I think that has to be between 5 and 7 or something like that. Well, for me, between 5 and 7 is more than okay, no more. Okay. So, and if you send me it, some of you, I don't know who is the class representative. Well, send me the groups for me. Okay, if you, all this is a group, A: this is a group, B: this is a group, C: for me, it's more than okay. Okay, all one of you or whoever. If you all agree on that, I agree with that. Okay, thank you. The classes. So, as I said before, we are going to move from logistic to artificial neural net, and you can you are wondering why you still regression well with logistical regression, it's like a baby. Neural net is there? And you will see that the elements that we are going to use to regression are the same for not the fiscal neural net, but in our deficiencies we are going to have more layers. Then we passed you some three important sessions about talking about the multi-layer perception. So, nevertheless, with more than two layers. And we have three classes from these one of them isOkay, but not only the slice. I'm going to record small videos for the important concepts so that you can have them there for review if you want. I'm also putting some videos that I think are interesting from Youtube and other classes from professors that could be important for understanding the concept better. Additionally, I'll put some papers or notes that are probably out of the scope of the course or more advanced, but just in case you want to explore further what we are talking about and following. We are following a very famous book about deep learning by Goodfellow, and I will put the chapters of this book in PDF for you to review if you want to know more than what is suggested in the slides. I will also try to summarize it in my videos and put more additional sources there. You will also have a folder with notebooks, some of which we have to do together, and some just for you to see how we use whatever. It's okay if you don't have to do anything with them, just having them for you. So we will see during the course all these kinds of things. Any questions here? Well, let's move on. As I told you, for the practice session we will use TensorFlow, the most famous frameworks being TensorFlow and Python. I like both of them, they are very similar. If you learn how to use TensorFlow, you are learning how to use Python and vice versa. Okay, so probably the most common framework for deep neural nets is Python, with some low-level libraries. I like to use Google Colab because for two main reasons. First, because all the libraries that you need are already set there, so you don't have to install them on your computer. And second, because you are using servers in the cloud, which is what you are going to do in your company. You are not going to work with your own computer, so probably you are going to run all of this always in the cloud. So you don't have to worry about that. And if you need to bind your computer with something, it is much better to use Google or Microsoft than your own computer. So I highly encourage you to use Google Colab. You have a data plan in Microsoft or something like that, but it's not open, so you have to pay. It's okay for me, but finally it's also possible to do it on your computer. But if your computer probably has some problems with the libraries and all those kinds of things, it's probably not worth it. So you can use your Google account, because it's a Google account. The faculty one is a Google account. You can work with that. I think that is the idea. And then for the rest of the class, let's talk about deep learning and what are the most important things that we need to know for working with it. There are a lot of definitions, but one of the most important is for the purpose of this study, computer's ability to learn. So the idea is that computers can learn much more complex or formal things. This is one of the most formal definitions: a computer program is able to learn from experience with respect to some task. Whatever your task is, whether it's classification or regression, you have a task. For instance, it could be classifying cats and dogs from images. And some performance measure. This is very important, because we need to measure how our algorithm is performing on the task, as measured by improvement with this performance measure. So basically, again, we are using Python with some low-level libraries. I like to use Google Colab because all the libraries that you need are already set there, and you are using servers in the cloud, which is what you are going to do in your company. You are not going to work with your own computer, so probably you are going to run all of this always in the cloud. So you don't have to worry about that. And if you need to bind your computer with something, it is much better to use Google or Microsoft than your own computer. So I highly encourage you to use Google Colab.Computers are able to learn in a more formal way. With all the ingredients, such as training data, inputs, outputs, a cost function, and a learning procedure, we are able to understand how much money is involved. We are all able to understand the form at once. The five ingredients that are used in many machine learning algorithms are the following: an input, in this case images of cats and dogs; an output, which is if it is a cat or a dog; a mapping function, which is the hypothesis linking the input to the output; and two more ingredients, which are the conversion of the input into something that can be fit into the system, and the normalization of the values between 0 and 1. The conversion of the input is typically done with three matrices of points, as we are working with colors and using the RGB structure of red, green, and blue. This is done by representing the intensity of every pixel with 784 different values, which are then normalized between 0 and 1. The mapping function is a mathematical model that links the input of 784 by 3 to the output of two continuous numbers, which represent the probability of belonging to either class. This is what is used in all machine learning algorithms, as we have inputs and outputs, and a mathematical function to link them.Okay, because this is the black box, or which is doing, finally, the link in what is interesting is that, when we have the three elements, the inputs, the output, and the mapping function, we have something to measure if we are doing this well or wrong. So we need a fourth element. They're talking about five elements, the inputs, outputs, and the mapping function, and the fourth element is something that says yes, that this is not really well done. Well, I'm not really sure if this is a but but the probability to be adopted given my system is 40%, and a cat is 60%. Definitely, it's not well, no matter if this is a cat or a dog. One of them will be one and the other is to be zero. Okay, because definitely something between a cat or a dog. So we need something else. We need something to measure if we are doing well or wrong. Our task is what we call the cost function. Okay, we need that cost function is in the heart of the machine learning to have something to measure if we are doing well or wrong on our task. So in this case, for instance, we have this output in our system, but we have the ground truth because we are doing supervised learning. We have the labels, and we know that this was a dog. So we can say, well, they already 0.6. And this is interesting. We have something to measure there that we have. Okay, and why is it interesting to have the error that we have, because you want to be the element that we will use to see if we are learning. And what is learning in that learning is that we are very. It's not. And we are going to say, well, I'm going to put some mapping function, some mathematical function, but I'm going to make it flexible. What is that of making your mapping function flexible? Well, I'm going to put some trainable parameters here something that is not really defined at the moment. Where do you give me my training data? So well, this is a trick in data. So instead of giving me, I don't know why is equal to 5 by whatever is x plus 2. You are saying, well, I'm going to put y is equal to W multiplied by my input and then plus B. So W and B are going to be trainable parameters, or you say to me, they are going to be trainable. So what you are giving me is like a flexible mapping function. And now the idea will be, well, what we have to do is to adjust this trainable parameters W and B to what? To minimize the cost function. So learning, indeed, is the process of adjusting your mapping function by tuning or changing your W and B to minimize the cost function. This is learning, okay, in machine learning. And this is how it's going to work. We are going to put a mapping function which is full of trainable parameters. And the idea is that by using the training data, we will use the training data in some magical way to adjust these weights and bias. Let me follow the idea. This is what they tell minimize error or the cost function of the training data, adjusting this mapping function because this is what we are going to do by changing those W and B. So let me put you an example. We want to classify these, okay, these blue circles and these red crosses. So once our for the class one and the other side for the class two. And we want to find a straight line. Given some training data, those are the training data that separate this. Well, we can have this hypothesis. The hypothesis will be, let me put it in blue. The hypothesis that you can give me is Y, which is a class? Is WX plus B. X is something that is fixed, okay, because it's your input. But you can say to me, well, this W and B are trainable parameters. We can put it in green. Because how do they differ from X? Okay, well, if you are looking for a straight line to separate those classes, definitely, you can, with this formula, with this formula, which is the formula of this try line. You can plot wherever this try line you want, because W is the slope of this try line and B is the intercept. How far are you from the origin? Okay, let me go back. So this is the idea. We have a mapping function which is flexible. We have the training data. We have the cost function. And the idea is to adjust this mapping function by changing the W and B to minimize the cost function.With this formula and hypothesis, you are not saying to me that you are looking for a straight line. You are saying that you are looking for a given straight line with W and B, which is something that you need to adjust. So in your start, for instance, with this W and B in orange, those are the ones that define this as the straight line. Your cost error is going to be high. But remember that the objective is to classify the data definitively. You kind of classify this data using this straight line. But well, during the training you can change W and B. Probably in some moment of the training you have something like this. We have the purple W and the purple B, which are different. Now, W is negative. You have a negative slope, and you have a different B, which is this distance? Okay? And finally, the cost error is medium, because more or less, you can classify this. You know they're saying, "Well, you are missing. Probably this point, and this point I more or less, is working definitely." You have something like this is going to work better. So this process of changing W and B, this process of changing the trainable parameters that you put into your mapping function is what we call learning. Okay, we are going to use the training data to learn those patterns. So the first thing is that you hypothesize that you can separate those classes with a straight line. This could be wrong. In this case it's not wrong, because well, just by using our eyes we can see that with a straight line we can do it. Okay. But once you say that you can find this straight line, the good thing or the magic thing here is that you are not saying which is that line; you are going to use the training data for that, just the two trainable parameters which in this case are the slope and then by use of this is our line is clear. Yes, okay. So learning, basically learning, and this is the same for any machine learning algorithm no matter how complex it seems, is to minimize a cost function, which is the error that you have in your training data by changing the and therefore adjusting for the better the mapping function. This is learning. Okay. So we have now, right now the four components: the input and the outputs, the mapping function, and then the cost function, because you need to measure how well your mapping is working. Okay. Now, what we need right now, the fifth element, is how to learn, which is a learning procedure, which is the algorithm that, taking the training data, is able to change for the better this W and B or whatever are your trainable parameters in order to minimize the cost function. So we need the fifth element, which is the learning procedure. But before that, we will start with linear regression. Let's just start. The other problem was classification. This is regression. The difference between classification and regression is that in classification your output is discrete. So you are just trying to separate classes, and you are putting a probability for every class, and in regression what you are doing is you want an output which is a continuous value. Okay, because you are, for instance, as what we are going to do here is predict the price of a house given whatever features. Okay. So the final output is a continuous number. If you are predicting the next value of a stock, as saying my stock value, what you are doing is regression, because the final goal is to have the value, the single value, which is a continuous value, which is, however, via regression. Okay. So the simply a simple linear regression problem is having the size of a house trying to predict the price. And what we have is training data. Okay, those are training data, those blue circles. And what we are looking for in the linear regression is a straight line that fits the data in the best way in the best possible way.If you do that, then you will be very good at predicting the price of different items. This is the point in this case: you put this X data point, and the price is going to be here. This is sadly the one that you have in the training data, and this is because your model is following the data. Okay, but how do we build this linear regression and how do we train this straight line? Well, the hypothesis is that there is a straight line that is following the data, which is not very good indeed, because these data are more or less lenient. But probably you have all the training data for other parts, so probably you can do a better fitting. But at the very beginning, we say, "Well, I think that with a linear with a straight line I'm going to be able to more or less predict the price of a house given the size." Okay, but the question now is, how to train this? Let me put it another way: if I decide that the hypothesis is that I can find these Wx plus B, the question is, how to find this W and B? That's the problem that we need to solve. We can start with something like this and definitely it's not predicting the price given the size, and this one is not also a pretty good one, and this one is better, but it's not the best. And what we want is this one, the first one. Okay, so which are these W and B? Well, let's go step by step. The hypothesis is Wx plus B. We have a cost function that we will see, and then we will see the optimization procedure. How is the cost function in a linear regression? This is a question: which is the loss or cost function that we use to use in linear regression? Come on, someone say "Least Squares." Least Squares is the typical one for linear regression. How is Least Squares working? It's quite simple. So we have these training samples we are building here at our cost function. Let's try to follow our intuition. We have the blue points which is our training data. We are deciding for whatever hypothesis, the hypothesis is that one? And we start with some random W and B. How can we measure if we are working? If this is working or not? Well, one of the things that we can do is measure the error with our training data for a measure of our training data. I would do the following: we are going to pick one data point, for instance, this one. Okay, and this data point has this price. This is the real, the ground truth, the real price of this house and the output of my hypothesis right now, given the W and B that I put here at the very beginning, is this one. Okay, this will be my output. My hypothesis is Wx plus B. We have a cost function that we will see, and then we will see the optimization procedure. How is the cost function in a linear regression? This is a question: which is the loss or cost function that we use to use in linear regression? Come on, someone say "Least Squares." Least Squares is the typical one for linear regression. How is Least Squares working? It's quite simple. So we have these training samples we are building here at our cost function. Let's try to follow our intuition. We have the blue points which is our training data. We are deciding for whatever hypothesis, the hypothesis is that one? And we start with some random W and B. How can we measure if we are working? If this is working or not? Well, one of the things that we can do is measure the error with our training data for a measure of our training data. I would do the following: we are going to pick one data point, for instance, this one. Okay, and this data point has this price. This is the real, the ground truth, the real price of this house and the output of my hypothesis right now, given the W and B that I put here at the very beginning, is this one. Okay, this will be my output. My hypothesis is Wx plus B. We have a cost function that we will see, and then we will see the optimization procedure. How is the cost function in a linear regression? This is a question: which is the loss or cost function that we use for linear regression? Someone say "Least Squares." Least Squares is the typical one for linear regression. How is Least Squares working? It's quite simple. So we have these training samples we are building here at our cost function. Let's try to follow our intuition. We have the blue points which is our training data. We are deciding for whatever hypothesis, the hypothesis is that one? And we start with some random W and B. How can we measure if we are working? If this is working or not? Well, one of the things that we can do is measure the error with our training data for a measure of our training data. I would do the following: we are going to pick one data point, for instance, this one. Okay, and this data point has this price. This is the real, the ground truth, the real price of this house and the output of my hypothesis right now, given the W and B that I put here at the very beginning, is this one. Okay, this will be my output. My hypothesis is Wx plus B. The difference between the reality and my estimation is an error, so definitely from this data point, for instance, from this one is one. One is one. I don't have any error because the price meets exactly with my estimation, so there will be 0. So the cost function is going to grow with your errors. If you don't have any error, it will be 0, which is what we want. Okay, we want lower costs. Well, this is the thing. We will review that in some moment. But when you have something which is very, very low in the training data set. So this is the cost function that we use for linear regression.It's not just in the validation that I said, where you have the high bias. This is what you have overfitting. If you have a very low run in the training and also in the validation that I said, this is not overfitting. This is what you want. So, having a lower cost function does not necessarily mean you are going to have overfitting. We will see that. To avoid overfitting, you need to have two different data sets, one for training and the other for testing, which you never use in training. So you don't overfit. You are working very well in training, but you are working just as well in the validation that I said. But there are different things. So you are working perfectly in your training data set, and you are working perfectly in the validation that I said, this is not overfitting, and you are going to have a very low cost function in the training. Okay. To take the errors into account, you need to do it with all the points that you have in the training that I said. The second step for building the least squares is very easy, so I'm following the intuition behind the least squares. First, do you see in the books least squares? And you think that well, this is where it's okay? Well, but what is behind least squares? Well, this is what you will do. You didn't know least squares, and let me follow the reasoning. The second part is to aggregate those errors. I'm going to aggregate them because for me this is saying to have a network in this training data point that in this training data point that in this training that a point, so the only that I'm putting is summing up for all the training that I have which are M elements. And yes, I mean summing up all the errors, basically normalizing this for the number of training samples that I have. And I'm doing that because obviously the cost function is going to be different if I have 10 data points or 1 million. Even though there is work done in the just because I'm summing up there, I'm going to be largest, that in some database where I have less data points. So I'm just normalizing this to make sure I'm taking the errors into account. This is taking the average of the errors that they have. When you sum up something, and you divide by the number of elements that you are doing, you are doing the average. So for me right now, the cost function that we create right now is okay. But if you go to any book, you will see that least squares is the same, but made it quadratic. So you need to have it squared. These we have this with the square. Okay. And probably you see that they added this squared here. This is a mathematical trick for cleaning the equation when we take the derivative, but at the very beginning, you can forget it. But what is interesting is, why do you think that they put the square? There are three reasons. Or why we put the square there. One of them is negative values. Yes, one of you. To maximize the value, so or why we want to maximize something else. To penalize large errors. So the first one is that overfitting and underfitting are both errors. So if I have, this 5 is my estimation, and 3 is a real one. This is an error of 2. If I have, my estimation is one and the real truth is 3. This is an error of 2, and it is from the same quantity. If I take the square of this I have 4 versus 4, so one in solving is the negative values is the first that you mentioned. This is more than okay. So you can also, if you put the absolute value, the norm, you will have the same effect. But we want this word for something else. But at the very beginning, this is one of the main reasons. The second main reason will be the one that you say, to maximize the value. So if I have a very large error, this is going to be penalized more than the small one. So this is the second reason. And the third one is that we want to maximize the value.So it's not the same for me to have something like that. 10 min, 3 is 7 is an error, and is greater than 5 min, 3, which is 2. And this is okay. But what if I say to my learning procedure that this 7 has to be penalized because it's a big error? Well, in this case I will put 10 min, 3. A squared is greater than 5 min. 3. A square is 49 greater than 4. Okay. So what I'm saying to my learning procedure is that this is not allowed. Because if you want to minimize this function, you need to minimize those values. So one of the reasons why we are using the square is to penalize large errors. Now why we put also the square? And this is probably the most important reason. You get. You can have the maximum likelihood, and I will explain something that is mathematically sound, but not just for putting the square. But it's there is all the reason that makes it very important to put the square. And we will see this exactly. You are making your cost function convex. And what is convex? Let me put you in a sample. You are doing this. You are creating a curve with one single absolute minima. And why? This is interesting because we are going to apply gradient descent in the same. But before applying that in the same, let me explain you. What is this: this is a cost function. Meaning that I call L to the cost function, because in the bibliography you are going to see that we call this loss or cost function. So I put the L for loss. And this is your cost function. That depends is a mathematical function that the P. Of an input, which is W. Which are your training patterns indeed, is for on W. And B. But for the sake of simplicity, I'm putting just W. How is depending on W. Well, in least squares we agree that the loss is one over M. The sum from one to M. All my hypotheses. Minos. Why? This is from the Y element and y element on my training data set. You want to put the 2 here, and we put here. So just by changing W. The cost is changing. So what we have here in the X and going to boot. W: Okay. And we are working with just 2 dimension, and when one W. So by changing W. By changing what you are doing is changing, remember this: try line that you are using for regression. Okay, because changing W. And B. It changing us sideline more or less errors. Okay. So we are now in the plane of the cost function, where, by changing the W, you are changing your mapping function and you are adjusting better or worse to your data. Okay. And definitely. So I can choose different W. Imagine that I choose this W. Which will be the slope of my. It's straight line, and there's no one, no 1, 2, 3, 4, 5 whatever. This is 15 or whatever. With this I have an error. which is that one. This is the cost that I'm suffering for choosing, but my W. And I'm saying to chase, but because definitely I could do better. For instance, if I choose this one. this is only 0. This is W. One. This will be better than this, just because the cost function is lower. Okay? And what will be the best, the best will be the one. the one that minimizes the cost function. Okay. this will be the W. I'm going to call a star is the one that minimizes the cost function. This is the one that I'm looking for. Okay. What happened if I'm not making my cost function quadratic? This could be something different. What can I do to minimize it? Well, in that moment I could also use a learning procedure. Let's say, why? Well, the first thing that you can do is before using. Could we? Yes. Use an analytical solution. I don't think so. So what we are going to do is to use a learning procedure to minimize this cost function.I don't know if you remember from your back or school classes, but if you want to find this point in an analytical way, you can do it. How? Just by taking the derivative of the cost function with respect to the parameters and setting this to 0. The solution of this equation will give you W*. Start. Okay. So if you're looking for W*, this is L and this is W. You can find this. Why? Because the slope of the cost function over the pattern is nothing else than the slope of this tangent and the slope of this tangent at the local minima is 0. If you go to any other point, for instance, here this slope is now 0. This is a positive tangent, positive slope. This is another one, but it's certainly the point where you have 0. You can find W*. You can do this and it's a notion of least squares. But if you put it here, you will find that the solution W* will be your training data. Put it into a matrix way, multiply again for your X transpose and Y, which are the labels. Okay, this is what we call the normal equation. So there is a way to find the minima, which is the optimal value that you are looking for, using an analytical equation, this one, the normal equation. So the normal equation says, if I have 100 samples, your X are 100 samples with just one single feature. In our case, we are using just the size of the house, so we have something which is 100 by one. You put it into a matrix in our Matlab or in Python. You multiply this by the same transpose, one by 100, 100 by 100 means that you can multiply those matrices and the result is going to be one by one. And just doing the dimensionality analysis of this part of the equation, you have one by 100 and you have the labels, your labels. You have a label for every single data point. Okay, the label is the real price. So what you have is something that you can multiply and this is one by one. Finally, you take the inverse. You still have a one by one matrix. Multiply by a one by one. What you have is a single value, which is a sadly the value that we are looking for, W*. Remember that your go to your mapping function is Y = W*X + b. We are forgetting the b right now, but it will be the same. But here are the next. We have a single feature, which is the size. So the only W that we are needed is a single matrix value. Okay, so you can do this and you are done with linear regression. You don't have to use any other learning procedure. You can apply the analytical solution, the normal equation. So the normal equation is okay, but it's not feasible to use the analytical solution when working with large data sets. It's not feasible. It doesn't scale. Okay, now one is using the analytical solution when working with large problems because it means that you need to multiply very large matrices and take the inverse of that, which is much worse. So we need to forget in real problems the analytical solution. Okay, when there is something else. And I'm going to end with the solution. The solution is using the most famous algorithm when working with math, which is gradient descent.Learning, which is gradient descent, is the one that you are using in linear regression, logistic regression, support vector machines, or decision trees. In whatever case where you have to learn something, what we are using is gradient descent with different flavors. In some moments, like we will see in Delaware, we will finally use gradient descent. How is gradient descent working? And this is the final part of this lecture. In gradient descent we have the following. It's going to be iterative. So it's not going to give me the solution at the very beginning. I don't know the solution, which is the one that I'm looking for, which is the one that minimizes the cost function, which is the one that creates my mapping function in the best possible way to fit my data. Okay. This is still my cost function. And this is the goal of the W that I'm looking for. This is the one that I want. Well, what I mean by descent is working in an iterative way. So we are going to pick a random W. Okay, we are going to choose a random W in the space of the W numbers wherever number. Here. Okay, we are going to pick a random one and iteratively we are going to repeat the following equation. And the following equation, that seems to be very complex is the following. The new W, the new W that I'm guessing is going to be the old one minus this quantity. This quantity. And what is this quantity? Well, we have a positive number here, which is Alpha which is what we call the learning rate, and for right now, for the sake of simplicity, right now you can forget it. We are going to talk later about the learning rate. And we have this, the derivative of my cost function, with respect to the parameter evaluated at the old point. Well, and what is that? Well, if I choose W0, for instance, here my, this is my first guess. This is the one my first is pick a random one. The only that I have to do is well. I'm going to take this derivative, and the derivative is nothing else than the slope of this tangent of this point that I put here. This is the tangent in red. This this is the slope of this tangent. This slope. Okay? And what I'm going to do is well. This is a number. This is a number two. I'm going to multiply those numbers. And I'm going to subtract the result from W0. So let's see. With this example, what is going to happen? Well, W1, my next guess is going to be W0. Okay, which is that one? This one. I'm going to follow the sample. Minus something which is going to be positive with this Alpha it's going to be a small and positive is the learning rate. And in this case, the slope of this tangent is positive. But this time I don't know exactly the number. Well, I I I can. I can compute this number because we are working in a graph, but it's low, it's positive or negative, positive. It's positive. So what I'm saying, basically the intuition is that W1 is going to be W0 minus something which is positive, multiplied by Alpha, which is positive. And this basically means that my next guess W1 is. I don't know exactly where it's going to be, but it's going to be at the left of W0, so it's going to be. I don't know maybe here, or maybe here. But at the left of this. And I like this a lot. because I know that W is a W start. He's at the left. Okay. So now W2, W2 will be W1 minus something which is positive, always the learning rate by something. Imagine that we are here. Let's take. I think, that they have these. The derivative here. The derivative. This is still positive. It's lower than the one before, but it's still positive. So W2 is going to be also at the left of W1. So iteratively, I am going to the point that I want when it's going to stop gradient descent. But you need a convergence criteria at some point. But what is the convergence criteria right now here with this equation? What can happen in some moment that makes my new W3 or 4 or 5 be equal to W2? Well, this is going to happen when this quantity is equal to 0. So this is the convergence criteria.Y: Exactly, if they have something positive here, but I have a sadly 0. This low fish never did. That is sorry. It's sadly Sidel. If you have a satellite, silo is 0. This point is going to be a subtler than this one. And this means that I found the point that I want, because remember that I'm trying to find the point with where this low. But the tangent is a saplicito, because it is the only point. The point that I'm looking for is the only point where this log is 0. That's the magic of gradient descent. I'm going directly, looking for a point where the gradient is a satellite, which is certainly the one that I want. You see that? So let's put another, the simple and just for finishing this A. In that instead of starting in this point where you start here calls. I told you that we are picking up run on W. So we can follow here. So we start here with W. 0. What do you think? Do you think that for looking for finding these w start, I need to change something in the formula or is it going to work? Exactly. W. One is going to be minus something positive for something negative, because now the slope is negative. Okay. Sorry. The slope is just close, or that's just aical point in the here is not crossing. Thank you. This is just because and doing, but with the the pipe. Okay. So W: one is going to be at the right of W. 0, because you are doing plus minus is minus plus is plus. So you are going to add something to W. 0. So you are moving into this part of the curve. Until, when, until you find a suddenly a point where you have acido slope. In this moment you are going to finish your search. And what about? And this is? Yes. Sorry that you don't have time right now. What is the role of Alpha, which is the learning rate? The role of the learning rate is how fast or how large are the jumps that you are doing between one point and then as one and your next guess. So if you start here and you're learning rate is large, you can do large jumps. If it's very small, you are going to do small jumps. So that's why learning rate, the learning rate is even the time that you need for convergence. What happened if you put it very, very large? You can miss the point. You can miss the Aolo Minima, because you can from here to here, and you can start bouncing. That's why in any library, when you press fit and you will start to train. You're Alvin, and you're using a learning rate. They are modifying in every step. They're learning rate according to what, according to the slope. If the slope of the tangent is small, it means that you are close to the point that you are looking for, and you will need a small learning rate to go carefully to your minima. And at the very beginning with the slope is very, very large. The learning rate can also be large, because you are very far from the point that you are looking for, and if you are very far, probably the best thing is to do large jumps. Probably you are going to converse quicker than it to do something very, very, very, very small. Okay, that's it. That's the role of the learning rate. Is it one or less clear? I'm sorry. This well, this is the algorithm that we will use in all our grabbing. The sign is what we are going to apply in the in in, and is the one that you are using for maximizing super vet or minimizing the cost function in a in a super vector machines. That means that you are maximizing the martin, and is the same that you're doing in linear regression, or in trees, or wherever you are using grad in the same. That again, what you are doing is minimizing the cost function by changing the trainable parents. The renewable parents are changing your mapping function to what? To fit in the best possible way, your training data. That's all. For today. We will go into the phone to logistic regression and then to artificial neural net. And you will see this by using different techniques. Okay. Any questions?You have several functions. This has prevented the use of deep learning for more than 20 years, from the eighties to the beginning of the 21st century. What did you say, David? Just for this, for these kinds of things, we will see how to prevent that and in the rest of the classes. But this is a problem. If you start here, you're ready in the same over here, you are going to end up here because the gradient descent doesn't know that there is another point. Here it is just looking for something with a slope equal to 0. If you start here, you're going to fall into a local minima. If you start here, you're going to be lucky, and also here. So in the eighties, people ran the neural net more than 2 or 3 times, or sometimes 1,000 times, just to pick different weights to see if they were able to find these kinds of points. Okay. So with a neural net, you can run a problem, and we will see that. You can fall here, but you don't know that. Well, indeed, but probably you run it all the time. This was a real problem. We will see why this is no longer a problem when we are working with complex problems with a lot of data. But this is a problem of gradient descent when you are not able to have a convex function. This is a non-convex function, where you are going from one point to another, crossing two points of the queue. Okay, one of the definitions. So this was our problem, falling into a local optima. During 20 years, people were on the neural net. What affected this? We will see why this is no longer a problem. Any other questions? I think that this Monday to Thursday, I'm going to open the forum. Once I open, it will be open for the rest of our lectures, no matter what the day is. So it's Monday to Thursday, but it will be open every day. Okay, because it's the same to us. That's something from the third class. Okay, ready? Yeah, then have a very good weekend. You too. Thank you. Bye bye. |