data_dynamos4 / CORR_TEXT_ZOOM_NLP_LM.txt
domro11's picture
All lecture files
a2594ba
Hello! How are you? Hi! There! Good afternoon! We have some minutes before the rest of the class joins. Can you see my slides? Perfect. That's all. Hello! Good afternoon everyone. I guess we can start right away if you don't mind switching on your cameras so we can see each other. As always, feel free to open your microphone and ask any questions you have during this session. So today is the big day. We are going to talk about language modeling. In the two previous sessions, we discussed the basic building blocks of natural language processing and the steps we need to create a machine learning pipeline. We defined a natural language processing pipeline, and we talked about the steps to move from raw text to pre-processed text. We discussed removing stop words and improving the representation of the text, such as the dependency part. We also talked about creating a document-term matrix, with binary, frequency, or TF-IDF weights. Now, we have modified the natural language processing pipeline by adding a detour. Instead of manually creating a feature-based representation, we are going to try to automatically learn the representation. So this is the idea behind language modeling. We are trying to create a model that is able to understand this kind of relationship between words, and also the semantic of the words, and also the different aspect and different relationship. And on top of that, the representation that you have from all of them for one domain is not something that you can reuse across domains. So we are going to try to solve these problems by creating another type of representation.Here you have a noun, and you have an unconnected one over there. You are expecting another. Now, this makes sense. But if you move to another level up, talking about semantics, the meaning of the words, do you understand what is the meaning of a mushroom or a pepperon? And you understand that, but it's a bit of a stretch. What are the types of ingredients that you typically put into pizzas? And not only that, but if you move to the pragmatic level, you can understand that in order to complete the sentence, if I'm telling you that I always see something on it, perhaps if you want to keep completing the sentence, you may need to know, or understand, or infer, that I would like to order this pizza. Okay, so for example, you could continue this sentence like, "Do you want me to order this?" If it's something that you like to do. This is not so silly anymore. Right? You understand that you are able to do that because you have a strong understanding of the language. And you can expect that if an artificial intelligence system is able to do that, it is because it is able to understand that. So this is the basic idea behind training language models. You can use the model to generate new text for completing sentences, and even better, you can take the representation generated by the language model and use it in another task. This is called transfer learning. So don't think of language models as just creating sentences, but also think of the representation generated internally.The first thing you do is calculate the probability of the first word in the sequence. Then calculate the probability of the second word in the sequence, given the first one. What's the probability of the third word, given the first and second words? So, this is what you translate to the example that we have. In order to model the probability of the sentence, the first thing we need to do is calculate the probability of the first word. Second, we're given the first one, so we'll give you the second word, and so on. Okay, and you multiply these probabilities together. This is how you do it. Okay, let's do it together. How do you compute this first probability? How do you compute the probability of the first word in the sequence in your book? Have you like? What do you think that's a good idea? Right, you have a training dataset, a training corpus, and you basically can count the number of times that the word appears in the corpus and divide that by the total number of different words that you have to make up the corpus. Right, so you will have the probability of the word. It's that simple, right? In the end, I'm always studying the same. If you are ready to get scared of probabilities, probabilities are just a fancy way of counting things like this. Do you have some things? The only thing that you need to know is what are the things that you would like to count. This is simple. Okay, let's move to the second. How do you compute this probability? How do you compute the probability of what word appears after the word? Any ideas? How would you apply the Bayes theorem here? It's a little bit simpler than that. So the idea is, imagine that you want to compute the probability of this word, right? So basically, the thing that you want to count, because remember, probability is just counting things. I would like to count the number of times that I see the word right. I would like to count that. How many times this thing appears in my corpus? And I would like to normalize that by the number of times that I see the word right. So that's the idea. If I just count the number of times that this is happening, I could get thrown off by the fact that it's a very common word, right? So I would like to account for that. So this is just a simple way of counting that, right? You just count the number of times that the word appears, and you divide by the number of times that the word appears right. So you know actually how many times it appears and how many times the word appears in your dataset. When you see the word, the word appearing in your dataset, how many times then the word? It's a simple computation, right? It's something that you can do right. Why? In the dataset that you have, language models. You can do that. But it's simpler to just do counting across sentences, because for how something that you can do is to model the relationship of the sentences to the previous one. Right? So you don't want to chunk your understanding out to the atom for something, right? So okay. So we know how to compute this probability and this product. It is simple, right? I hope that you follow me on that. You agree with me that in order to compute the rest of the probabilities, basically we can do the same. Right? If we want to compute this probability, the only thing that I need to do is to count the number of times that the sentence, each order is seen, appears in your dataset and divide by the number of times that the word appears in the dataset. Right? So you are able to count on the product.This is what I do over there. Do you envision a limitation with this approach? Oh, that's it. And thanks for that, because you basically explained the Enron based modeling. How many times do you think this sentence will appear in Wikipedia? Not so many times, right? If you're lucky, it could have been a couple of times, three times. As you can expect, a probability with just two or three occurrences is not a very good idea, right? The probability you will estimate is not going to be very likely. And up to some point, that's what I mentioned. If you keep extending the context, you won't be able to even observe a unique example, so you won't be able to model this product. When you try to model relationships, long-term relationships, you cannot do that following this approach. But we have a hope that all of us will have to play with a GP and something similar, right? I'm sorry if it is. I want to generate really long sequences while keeping coherent with even the information I get from Denise prompt, right? Even if you input a problem with hundreds of words, so how do you do that? But because the initial project we have is the one you mentioned, right, it's called a language modeling, or also the Markov assumption, right? Because there we are, we have a simplified assumption that in order to model this probability, we don't actually need the entire sentence or all the words I've seen before, but just one word, two or three words forward, five or something like that. But so, in order to model, for example, this probability, I can't approximate this probability by the two previous words. But they are in gram base now with models, because this is a of an algorithm, right? I hope you remember that from the practice we have for post-it. That's basically you can create this Enron, which are basically sequences of size two-gram, three-grams, and so on, right? And as you experimented and experienced in the practice, if you use some really large sequence size for post-it, in that case it didn't work, right? The optimal performance would achieve what around two, three, four, five, and it wasn't because of that we talked about a little bit on the forum about that. It was because, well, if you try to, if you take a well, you're going to upset enough, for example, in your data, right? So you won't be able to move on that. So this simple final assumption is something we applied previously to deep learning, because that is, that was the only way in which we were actually being able to train these language models, right? And otherwise, if you try to model 20 or 30 or 70 positions before, you are not able to do that. Okay, these models, they were not perfect, but they worked. In fact, I'm modeling something that started around the sixties and the seventies, right? Well, we always had language models, but honestly, we did not pay a lot of attention to these approaches because they were rather limited, I would say. So, for example, following this approach, and this is what I call approximating the this, he said, the kind of sentences you are able to generate with this approach. So, for example, let's focus on the try them. So basically the idea is that I'm modeling the next word, trying to predict the next word, given the three previous words, right? And as you can see, well, over there, you have some G that looks like English and it looks like the language you could find in the Wall Street Journal. That is not the quality you expect nowadays from BERT or GPT and so on. This is not sad, but it's a dip. You can have an actual meeting, full conversation over there, it's not an actual center, right? It's something that looks like a sentence, but not quite so. But it was still a revolution. Yeah, we had some others from 40 to 50 years ago, but we didn't care much because their performance wasn't impressive. As you can imagine, the kind of language model they were able to create was rather useless. So what's the point of GPT? It's a language model. Are they going to be able to achieve such astonishing performance? We'll try to explain that, rebuild the architecture, and discuss GPT. We'll also create representations that will be useful for the next session, where we'll talk about text classification and how we can take advantage of the trail. Does anyone have any questions or comments?We'll wait mostly a decade until we actually saw some models, many based on recorded neural nets, some since 2017 on transformers, which is another type of architecture to actually see some good performance based on this. We apply deep learning to what type of neural net we want to apply. Are you more or less familiar with learning all of these ideas in deep learning? The scope of this class is not for you to deep dive into recorded neural nets, because this is something that you are going to do in the development. The only thing that I want you to understand is that the whole point of it is able to model a sequence. A text is a sequence of events, and the work that I am upstering right now is related to the previous event. Typical of the recorded neural nets are good for modeling, and I don't know for most of the time series, or formally, in any kind of sequential information, such as audio or video. You have an input layer, a hidden state which could be more complex, and you have an output. The idea is that the recorded neural net should be able to given this input model the input in a clever way, the hidden state to be able to output the information that you want to get. In this case, the information that you want to get is if I am giving you the first word of the sentence and the second word of the sentence, what is the next word of the sentence? Remember that the whole point of language modeling is this problem of if I am giving you a sentence, but a big them as well. With this input and this input, what do I want to predict the output? What is going to be the output? You can keep probably not for many, many, many, many sequences. What's the difference between because of the you are now actually able to model really long sequences. In fact, you will talk about a specific type of neural net which is called Long Short Term Memory Network, meaning that it's a type of neural network which is great at modeling sort of that this work is related to the previous one, but also long term the licenses right that are given work in the sentences for hop related to the preview centers, or to the first work in the sentence. That's the whole book. Again, the details on the specifics of this vocal renewal nets and long short term memory networks on what's the reference between a long short term and a memory network, and I'll default record a neural net is something that you will learn at the right.Now, we are able to do exactly what we wanted to do: generate new sequences of words related to the previous one. The metric used to evaluate language models without diving into perplexity is called "perplexity". It is trying to model how well the probabilities estimated for the words that are going to be done at one in the centers are, meaning how similar they are to the words that you input into your training. For example, if you input "I order a pizza with cheese" and the model outputs "pepperoni", that makes sense because it is likely in the dataset. By applying long short term memory, we were able to improve by a lot. This was a kind of breakthrough in natural language processing. The perplexity levels we have with GPT models is around a team, which is an order of magnitude better than the perplexity that we have with traditional models. However, there are still limitations with neural nets. For example, we are not able to consider the order of the words or different meanings of the words. Additionally, the vectors are not domain specific. For example, if you want to create a language model for modeling contracts, the model trained on Wikipedia or a corpus of news will not have the same vocabulary. Ideally, we would like the representation to be related to the kind of documents we have. Between 2014 and 2016, we were able to account for this by using different types of architectures.That was when the blurriness actually impacted the processing of this model called Elmo. Right? So the idea of Elmo is that this model is a record of that? Okay? And what is the main difference compared to the previous month? Okay. The whole point of Elmo is that it tries to create a context-based representation for the words. Okay. So the idea is that you still have the words that are input in, and you are trying to predict the next word. Okay. But in the previous example, we only trained the weights of the model. This was the only thing that we were changing. The representation was always the same. We tried to always retrain this embedded representation by doing that. Actually, what you are doing is creating an embedded representation that is specific to the given sentence. The meaning of that is that the representation of the word will depend on the context in which it appears. If I use the same word in a different sentence, it will have a different representation, which will be relevant to the sentence in which it appears. So if the what you are seeing now, I don't know, in contrast, if it applies on Twitter, if it appears on Twitter, then the meaning or the representation will be different, because the context will be different. Okay? Well, the idea is that if you want to do that, you need to use this dataset to retrain the representation. Okay, as we will see now, the whole point is more than just imagining that you have a dataset. For example, with Wikipedia, there are going to be some words that, depending on which Wikipedia page they are on, they are going to have a different meaning. Because it's not the same to be on a Wikipedia page talking about us. So that's the point. The representation of the words will depend on the rest of the words that appear in the context. Right? That's the idea. If you have a dataset of training data with legal contracts or regular things, the representation will be adapted with one of them, but once you have, of course, if you want to apply that to a different domain, well, there, for a model that does not know about the new domain, okay, you need to do something that we will see later on. We will explain with more detail in the next section, which is called fine-tuning. Basically, you will need to retrain again. Then you run that. Okay, because the legal contracts that you have, you were not exposing the model to that before. Okay, the main difference is that if you are training your neural net with legal contracts or Wikipedia, the neural net will be able to understand, okay, in this context, the meaning of the sentence is that. And it is because of the rest of the words that appear there. Okay, that's the whole advance of Elmo compared to previous representations. Previously, the representation of a word was the same, no matter in which context it appeared. Okay. So I know this is a huge revolution. Why is that? By the way? But for more details, I do recommend you to check the JRM blog, because it explains all of these ideas and all of these deep learning architectures in a very well and intuitive way. In case you want to deep dive into these architectures, and I hope that I may have helped you. You know, this is also going to be relevant for the so. Why? Because I told you that this is a language model. So the idea is that if I am giving you the sentence, you will be able to predict the next word, which is an interesting task, but actually is not solving any real problem. Right? So what they did, okay, actually what they did, and this is the important part of language models. If you see here, you have some input layer and some hidden layers and the output layer, right? And this output layer is basically a language model layer. Okay.By the way, these output layers are commonly referred to as the head of the neural net. Right? So you have a language modeling head, meaning that you use this representation to then predict the network. Okay, this is how it works. You delete this head, this language modeling head, and put on top of that a classification head. So basically, you take this representation and you use it to train a text classifier. Okay. And now you are able to do classification. Another thing that you can do is you can use that to do machine translation. Okay. Given a sentence in one language, you can translate it to another language. Or you can put another head, which is for predicting if two words are related to each other, or for whatever natural language processing you want. This is what they did. Okay, this is what they did. They were able to improve it on every one of these tasks. Okay. The advantage was not that they created an excellent text classifier or an excellent machine translation system, or whatever. That was not the point. The point is that this representation was excellent. Once you have this representation, which is so good, whatever you apply it to will perform better. I hope you remember from machine learning the idea of feature engineering, right? The feature engineering is that you spend a lot of time dealing with your data set in order to add new features, to remove the relevant features, to scale the features, and so on, in order to have a good representation of your data. By being able to do that, when you apply then a simple classifier, it works. It works rather well. This is more or less the same idea. Of course, this is not a completely automatic way, but the idea is that you have created a representation which is so good that, when you apply it to very different domains, it is able to perform well. These domains, actually, we call them different tasks. Okay, that was the first big revolution. Just to introduce you to some of these ideas, you have a pre-trained model because you have a model that has been already trained. So you don't actually need to train it all again. You just take it over and apply it to another context. This idea of applying this pre-trained model to different problems is called transfer learning. Because you have some learning that you transfer to different tasks. Okay? And the idea of retraining this model for classification, and so on, is typically called fine-tuning. Okay, because you fine-tune the model. You don't train it all again, because the model has been already trained on language modeling. You just fine-tune the model. So now the model is able to teach you how to speak English. And once you know how to speak English, now I can easily teach you how to play soccer, how to connect, or how to do machine learning. Right? But if you and me, we don't speak the same language, it's going to be very difficult for me to explain to you how to play soccer, right? So the language modeling task explains to the computer how to speak a given language. And now that the computer is able to speak this language, you will be able to now train it to do several other things. Okay, that's the whole idea, that's the whole analogy. This concept has been, by the way, one of the most important concepts in artificial intelligence. This is something that we are now trying to apply to all the different scenarios of computer vision, tabular data sets, video, music, and so on, to be able to pre-train a model in a task that is simple to do. For example, translation is something rather simple, and we have a data set readily available to you, because you can basically download it from the internet randomly. And then you can use this data set to pre-train a model.Remove this sentence from the data set and predict them. So it's basically you have to take the data for free. You can train a huge model and then you can use this model to solve a specific problem. For that, you only need a small data set. We will talk more about that with the example of classification. That's the whole point. That's the whole idea. Is this clear? Is there any question? Do you need me to give you some of these concepts? Any questions before moving on? No, not really. Typically, they train these models. It depends on what you want to do. Typically, they train this on the newswire, which is basically a corpus of news reports, also on Wikipedia and the book corpus, which is a corpus of books. So now, this is an important topic because it depends on what you are doing. We know about some gender bias and they are trying to avoid it. But of course, if you are using a data set which is biased, this is what it's going to learn. So you need to be extremely careful on the data that you are using to train your model. Typically, they use this large data for training them. Now, what they are doing is basically downloading resources from the Internet and creating these resources in order to make sure that they don't have any kind of bias. But here we are talking about actually billions and billions of documents. Okay. So yeah, it's actually an open AI, Google and Facebook, and so on. They are very careful of what they say they are for more of it in any case, so that you kind of always refer to the paper when they talk a little bit about the selection of the data set that they have used for training them. Okay. But basically you need terabytes of data for training this model. The downside of this is that you need a lot of data as well. Okay. So we have what we need. Previously, we had 10-gram models. We were not able to model long-term sequences. Now we are able to actually model a long-term sequence. But this is not 100% right. They actually struggle when you are trying to model a sequence of 256 words, or when they try to model a 1,000 words. But on top of that, when you want to create a model, because, as we will see later on, one of the main ideas here is that if you keep scaling up the model, the model will perform better. If you have a model which is 10 times larger, you expect it only to perform 10 times better. Okay. You can't scale a lot these recurrent neural nets because if you do that you will not be able to train them, and it will take forever. Right. So we need to shift on part of that, we need a new type of architecture. So, entering the transformer. So what is this precisely? Okay, actually, this is from a paper of 2017. The paper is called Attention is All You Need, and they completely changed the idea on how to train these sequential information. Okay. So in this model, how did you account for? I hope you remember this problem of dependency, right? When you have a sentence, and you have this relationship right? What we did in the past? So this is the noun. This is the verb. In order to be able to account for these relationships, you need to be able to follow the sequence right? So you need to be able to model the word, to model their relationship to the previous word and to model their relationship to the next word. Okay, for this example, it's not really problematic, right? The size of the window is small, but in the end, if you have another word, you need to be able to follow the sequence. Okay. So yeah, we need to shift on part of that, we need a new type of architecture. So, entering the transformer. So what is this precisely? You can visualize the relationship between the words in the sentence and the model it. So you can see that the darker the color, the more related the words are.So this is the first layer, and you can check the different types of relationships that you can run. Is that right? So you can see over there the actual relationship between rewards. And when you check it, it's one of the heads. You see what is learned. Now you see what makes sense, right? It's the relationship between more that are related to the point of that. That is, learn automatically from the data. How? By doing that? Okay. So this is the whole architecture of the idea. As you see, you have a lot of things over there, so you have some multi-head attention on fully connected neural net, and so on. So the idea is that when you do this, it's going to be automatically alert. What you need to do is to be careful when you are training the neural network. Okay? So there are different things in which we train the neural net. For example, in the sentence in English, they call that the sentence response, right? So what the transformer is going to do is to create the representation based on this idea of the self-attention for the sentence in English. Yeah, you will create the same representation for the sentence in Spanish and you will put together both representations, and then you will try to predict the translation type. So we have millions of data points with millions of documents. You are able to train this meaning that you don't have a unique model. You have several models as to get so typically these these layers are those sense of levels deep, right? So you have close to hundreds of different layers in this deep neural network. Okay? So then you and that is true, Libby. But the idea is that the idea is that by properly training these types of neural nets, it's way faster than what we were not able to do with the recurrent neural network. You could remember that you needed to go through all the words in the same with the self-attention. But in this idea, you don't need to do that because you are able to directly model this representation in the self-attention matrix, right? And in advance, I told you that here you have multiple so training these types of neural nets is way faster than what we were not able to do with the recurrent neural network. For example, with Elmo, we were not properly able to model this long-term relation. You could remember that you needed to go through all the words in the same with the self-attention. But in this idea, you don't need to do that because you are able to directly model this representation in the self-attention matrix, right? And in advance, I told you that here you have multiple so training these types of neural nets is way faster than what we were not able to do with the recurrent neural network. For example, with Elmo, we were not properly able to model this long-term relation. You could remember that you needed to go through all the words in the same with the self-attention. But in this idea, you don't need to do that because you are able to directly model this representation in the self-attention matrix, right? And in this context, what does it mean to understand, along with that, this self-attention? What is the advantage of this transformer architecture compared to the recurrent neural network? Next to Elmo, for example, with the recurrent neural network, we were not properly able to model this long-term relation. You could remember that you needed to go through all the words in the same with the self-attention. But in this idea, you don't need to do that because you are able to directly model this representation in the self-attention matrix, right? And in this context, what is the advantage of this transformer architecture compared to the recurrent neural network? Well, they did the same with, but they trained, but on something in this case. They trained a model called BERT. This was trained by Google in 2018. Remember that we talked about Elmo. Remember that we talked about the huge impact of Elmo in natural language processing. Well, they did the same with BERT, and they called it in the context of Mask Language Modeling. And for the sake of the example, you can think about predicting them as well is more of the dissemination. Right? So we already talked about that DC. I. Of predicting words. It might seem silly, right? My cousin Paul. But if you are able to do that, it is because you understand the language, it is because you understand all of this linguistic, lexical and some of the calculations, right? And in this context, what is the advantage of this transformer architecture compared to the recurrent neural network? Well, it is faster to train, because they are several orders of magnitude faster than the hand-crafted models.The point is that by using a neural network, you can actually do this. So here is what we do. You need to gather billions of documents to train this new model. But now you are able to use the data that you have in billions of documents, because you have a huge network that you can train. Of course, you will need a lot of money, right? I don't know if I have the numbers here, but around $250,000 per execution. So $1,000,000 in total. So training this is not straightforward. Exactly. You don't need to remember. You just need to reuse the pre-trained model for your project. So more details on it are in the advanced materials that I included. One of them is the model that we use. What is the model that we're using? So now we have all the pieces that we need. We have this idea of pre-processing. We have this idea of training. I have some information that I would like to model, some text file, and something that I would like to model. I now have all the pieces. I know how to preprocess it, and I know that by applying this language modeling, I will be able to create a representation that is good enough; that then, when I use the representation, I will be able to train a lot. These, the original models of language were not good enough, right, because of these limitations that we talked about. And then we proposed deep learning. We proposed to use a neural network, which is way better, but still some limitations in terms of the training and the size of the corpus. But now we have this idea of the transformer model. This architecture, for example, is able to take a lot of documents and to learn from them, right, and to create the representations. So that's the idea, right? That's more or less the picture. Okay? So yes, please go ahead. That's a good question. In the pre-processing phase, we don't need such a detailed pre-processing as we explained here today. Right? Why? Because if you remember, on this slide we had a table with the different types of pre-processing that you need when you are going to actually use the model. So basically this pre-processed text is really like processing for you from the practical point of view. We will see that in practice you don't need to do anything. You just need to input the text and create some reformatting of the text a little bit. But basically all the pre-processing will be done internally. Now, what kind of processing are we doing? Just to make sense. Yeah. So this is rather convenient, right? Because you don't need to bother with the syntax, let me say it, and so on, right? Not anymore. If you are going to rely on this model. Okay. Okay. But I still see a problem that I hope you still see. Okay, you have your pre-processed text. So imagine for the sake of the example that you want to do sentiment analysis. So you have some text information. So some information for doing sentiment analysis, right? You would like to classify something into positive or negative. You cannot train with your small dataset of 10,000 that you would like to train from. And you cannot train from scratch. But because I mean you don't have $250,000 to train for your project, right? So why is this having such a big impact? Because again, in order to train these models, it's difficult. There are parameters in billions, and you see, like all of these models, the complexity of these models and they keep scaling. Yeah, we have also some numbers on the complexity. I told you $250,000, but also to give you more details.What you see over there is the amount of CO2 emissions that you will generate in your lifetime by driving your car. Okay, Maggie, you have a regular gas car. Okay. If you want to take the train, I'm fine with that. You will generate 6 times less emissions than driving your car in your lifetime. So I hope this gives you a more or less intuitive idea of how complex this model is. So basically, you need a huge cluster of GPUs trained on the model for several months. Right? So it's not feasible. So what's the idea of that? Why do we need to? Why do we need to use these models? Because we have something called transfer learning, which I will just introduce today. The idea of transfer learning is that you don't actually train. So this is the scenario, right? This is the picture. So you have, not you, but typically this is Google, right? So Google, they do have a huge dataset. The entire dataset is that they have a huge dataset. So they train on that, on generating new or whatever. So, finally, thanks to their work, you are going to be able to have a fine representation of your data. So let me just, yeah. So no, you have better Google to train. But for you, you can download theirs. Now you have a specific task, a specific dataset that you have for your sentiment analysis. So you have 10,000 samples annotated as positive and negative, you somehow preprocess the dataset. And now you use theirs to create the representation of your dataset. Okay. So basically, the idea is that you take your preprocessed data and pass it through theirs. We will see how this is done in practice. Don't worry. And you will have an excellent representation of the dataset that you care about. Okay. So it's not that you are learning everything from this tiny dataset. No, you are not actually. What you are doing is just using it to represent your dataset. Okay. And now, as this representation is so excellent, whatever machine learning model you apply, as simple as the reverse and classify it will work. Okay. That's the idea of why these models are so important. In any case, as they are so complex, something that we have tried to do is to reduce the size of these models. And this is a concept of model distillation. The idea of model distillation is that instead of using that for, say, you take a model which is a smaller, half the size of theirs. And you train this model to predict the output that theirs will give. So it's not that you train this model in language modeling again. No, I don't care about that. It's that you join this model. If I give this input to theirs, what is the output that we will generate? And the model has to learn that. Okay? And you know that because you have the original model which is theirs. So you know what is the output, and you can retrain the small model to predict the same output. So actually, you have a model that is a smaller number, and once it has been trained, it will generate the same kind of performance, because in the end it was trained on generating the same responses that theirs would. This is the idea of model distillation, and in the typical model that we use is this still theirs. This is another that has 95% of the better performance, but it only has half of the parameters. That is a that. Is that the out model distillation? I just wanted to include this idea, because, from the practical point of view, many times you would like to use a still version of the model, but because the performance is more or less the same, the vector the models are rather small. Okay. So this is one of the trends, right? To remember this point of. And you see back in this way. So one of the trends is to reduce the size of this model. So you have a small. It was able to learn from a single example. So this is the idea of the transfer learning. So you have a representation that is so good that you can actually transfer it to a new task. So this is the idea of the transfer learning.In 2019, they wanted to redefine all natural language processing problems as a language modeling problem because GPT-3 was so successful. If I give you a sentence, I want you to predict the next word that is actually solving the task. For example, if I give you a task to translate English to French, I want you to predict the next word. But the next one needs to be a very specific thing, which is the answer to my query. This is why you only give 0 samples. Perhaps you want to give the thing one example or you want to give the thing some more examples. One use of machine learning is text classification. Previously with BERT, as I explained, and as we will see in the next session on the tech classification class, first you do language modeling in predicting what? And then you use this representation and retrain it for text classification. Here you are not doing that. We are just doing the language model instead. You only ask it to generate new sentences. Okay, so it's eliminating the need to train the model. In fact, if you think a little bit about that, this is astonishing. This is the same way we humans learn. If I want to tell you how to do something, I explain it to you and give you some examples, and you are able to do it yourself. So what AI is trying to replicate with GPT-3 is this way we humans learn. So you will see the difference in the next session when we do text classification with BERT and similar approaches, which is basically a machine learning problem. You need to train the model. But if you want to do sentiment analysis with GPT-3, you are able to input a tweet and classify it into positive and negative without training the model. This is why GPT-3 is so successful. The model is several orders of magnitude larger than previous models. One thing we will try with some of our model coding is to train a smaller model with more data. This is the idea behind Google's BARK model, which is similar in size to GPT-3 but trained with a lot of data.Apparently, they have to spend more time and date on the model. Okay, so in any case, we're in this race of larger models or models with a lot of data because the idea is that by scaling up these models, you can achieve great performance. This is a comparison between the performance with one sort of learning and the example of GPT. You see that only when we have a model with the size of GPT we are able to achieve a level of performance that is similar to previous versions of GPT. Not only that, but what you see is that when you train the model for a specific task, such as question answering, you see that when you reach the level of complexity of GPT or DPT, no matter which sort of learning you use, you are achieving a performance that is similar to what you would achieve by retraining the model. But if you train a smaller model, you see that the performance starts to plateau. It seems that there is still room for improvement by scaling up the model, and we will see future versions of GPT. The whole point of this is that it was crazy when they released GPT. Everyone was talking about it, not just computer scientists or NLP practitioners. Even my wife, who has no expertise with computers, uses these things for a lot of things. Basically, the model is rather similar, but bigger and bigger. Finally, I want to explain the difference between GPT and GPT-2. We train the model to be a conversational agent. Basically, what they have done is they have already trained a model, GPT-3, and asked it to generate an output. For example, explain reinforcement learning to a 6 year old. The model will generate some outputs, and a human annotator will tell which ones they like. Then they use this data to retrain the model a little bit. So you should be able to generate outputs that are the ones that they annotated. Then they train a reinforcement learning model on how to pick the right output from the model. So they are training a reward function, which is basically understanding which output is more likely to be relevant for a human. Then they create a policy for this model, which is you input a new prompt from a dataset, train a policy to generate the right output, and have a reward model to create the reward for retraining this policy. The combination of GPT-3 and this policy is what makes it so powerful.Generating sentences that are cohesive, relevant, and allow for conversation can be done with GPT-3 Open AI. They have an Open AI playground where you can play with GPT-3 and you will see that it has excellent performance. However, conversations like this cannot happen because they have not retrained the feed into it. So, what they have done is to include the human in the loop. That is the main reason why GPT-3 is free for now. They are going to reuse all the conversations that we have with it to retrain the feed. By the way, they have opened their premium version for GPT-3 in all the markets, including Europe. You now have premium access to it with no limits. So, what is going to happen next week is that they are somehow linking the flavors you have to pay for. In any case, if it is worth it for you, $20 per month is not much. Basically, they are using human annotators to touch up GPT-3 and have this kind of conversational conversation. That is basically it. We are going to talk about these ideas, so hopefully now you have a better idea of why we are able to do this. It is a crazy new thing, but we have this transformer idea from 2017-2018 and now we have more data and computational power to train them. This transformer idea is nothing very fancy, it is just a new neural deep learning architecture that is able to implement this idea of language modeling that we have in this 67%. But now we are able to do it with proper performance. This transformer architecture is the basic architecture on the unified model, if you want to call it. We are going to talk about text classification, information retrieval, question answering, and whatever else you want to do now in terms of natural language processing. The only thing you need to do is to take one of these pre-trained models and train the model for your problem. With this transformer idea, or even better, if it is a GPT, you don't need to do that. You just need to rely on these huge language models that you don't need to retrain for anything and just ask them. If you want to do sentiment analysis, the only thing you need to do is ask it to classify a dataset of two. I don't recommend you to do that. After the class, go to the GPT and play around. Take a couple of it and try to classify them, give me the main topics, classify them into positive and negative answers to this question. I'm usually that it was. But as you can. Yeah, that's basically it. On the rest of the slides, I have some more details on some other models, in case you want to deep dive into it. But the basic ideas are out there. On the following session, we will talk a little bit about that and then on the next sessions we will put them in practice into different tasks. Do you have any further questions or anything else you would like to comment before I close the call? It seems so. It is for a couple of reasons. Well, in this architecture, there are some tricks, some normalization of this, some pooling layers in order to avoid our feeding. For sure this is where. But in GPT, they do have a similar trick.The point is that your data set is so big that even very complex models are not able to overfit it too much. So, overfitting with the neural net is not such a problem, because you are playing with a huge data set. In any case, if you don't know the entire Internet and you overfeed it, it's not a bad idea. You have all the knowledge of the world at your disposal. So, what's the point? The main problem is that it is learning the noise of the specific piece of our real data set. For we have the same that this is more, and it is not representative of the real problem. But if your data set is so big that it is representative of the real world, then what's the problem? So, no, still not a big issue with overfitting. Okay, perfect. That's basically it. See you next week on the Forum.