data_dynamos4 / CORR_TEXT_ZOOM_NLP_PROCESSING.txt
domro11's picture
All lecture files
a2594ba
Hello! How are you? Good evening! Allow me to come onto your screen and let me know if it is visible for you. Perfect, it is coming through now. So this way we can go through a couple of minutes and everyone will have enough time to access. Nice, nice doing. Well, let's wait for the rest of the classmates to join and then we can kick off. Last week we had our first session and, as I told you, this part of the session was a bit more theoretical, just to talk a little bit about the different details and aspects related to natural language processing. You just wanted to know what your opinion was. In fact, you have already a forum set up on the bottom right? So it would be nice if you can go over there and let me know if there is something that you have changed or if there is something you are missing. This week the phone session will be a bit more practical and we will be balancing these types of problem sessions. Once they will be more theoretical to talk about the ideas and details and the other one will be more practical. So any comments that you have, just let me know about that in the forum so we can adapt or if there is something in particular that you are missing. Additionally, as you will see in the syllabus, there is a folder called additional materials and I don't know if you can find some materials or resources that I consider interesting, but it is not mandatory to review, of course, and this is not something that is part of the material that we are going to cover in natural language processing. If you go to the blackboard, you will see a new folder over there with the system recordings. So what I will try to do is I will try to regularly upload all of the recordings of the classes over there. So later on, if you need to double check something or you need to ask something, you can do it. Okay, so today I would like to start talking a little bit about the different building blocks that we will need for natural language processing and then we will try to review how to actually apply them and how to apply it to specific tasks. Before starting, I would like to give you guys some time to check if there is any question, anything that you would like me to review from the previous session. Just for you to remember, in the previous session we basically went through the basic ideas of natural language processing. We talked about what the natural language processing problem is, why it is difficult, what are the different levels you need to achieve in order to do natural language processing, what is the starting point we are with this, what is our ambition to achieve in natural language processing. I don't know if there is anything that you would like to comment or anything that you would like to review. Okay, I will say that at any moment, if you feel that you are falling behind, let me know and we can devote some time to review that. Okay, I think that we have enough time for that. So yes, let me know in case.Today, I would like us to review many different ideas and concepts. The idea of this session is not that after the session, you understand all of them in detail; I will give you the pointers and the ideas. Later on, you can review them and extend into this. We are going to review many different ideas and concepts related to the ways of processing text learning. Because we already have experience with data cleaning and feature engineering for machine learning, these ideas are a little bit more familiar. We try to do the same to clean the dataset and construct the most meaningful features. But because we are dealing with natural language, it is a specific type of content that is not so simple as the nice and well-behaved data set that we had for machine learning. So we need to understand how we can deal with them. Today's session is going to be just about that. Don't be overwhelmed because we are going to review many different ideas and concepts. Again, the idea is not that you understand all of them in detail today, but after reviewing the material, so it at the time with the practice session, you can try to understand it. In any case, what I am going to do is to provide you with a list of resources. It is not that you need to apply them all, as we will see. It is not that when you are doing natural language processing, you need to apply all of these pre-processing steps. In fact, as we will see in the next session focusing on language modeling, today, thanks to deep learning approaches, we don't actually need some or all of these pre-processing steps. We in any case consider that it is worth covering them for a couple of reasons. The first one is that even if this is something that your neural net is going to do, it's nice that you understand what's going on, what is the kind of aspect that the neural net is able to model, and why you need to run it. It is not always that you just take a huge neural net and you apply it out of the box. You need to do some data pre-processing for it to work properly with modern language. So, even if today it seems a little bit outdated, some of these technologies, again, because you just rely on this big pre-trained neural models, I think that this is what I want to cover. Since we are going to talk about natural language processing, I don't think that a natural language processing class will be complete if you don't know about pre-processing, stop words, lemmatization, and documents. And, in fact, as you will see in commercial scenarios, we will see many, still, many solutions implemented following this idea. So you need to understand them today, and perhaps you need to get them in the job. Okay. So I guess that we can kick off. We are going to have three slides, one for basic text pre-processing and another one for parsing and tokenization. I will try to review them all. If at some point you feel that I'm covering the details too fast, let me know, because we can restructure the class. Even if we don't have enough time to cover them all, I can, after the class, upload a video to summarize some of them. Okay. So let's kick off.This text contains information that can be understood by a machine. We will apply a feature engineering methodology to transform it into a representation that can be used for text classification or machine translation. Today, we will focus on text processing. We will review the rest of the concepts in the next sessions. In particular, we will understand what natural language processing is. This representation is the most important concept. All of these big language models, such as feature representation, allow us to encode all the different aspects of language in a proper way. In the past, we were not able to do this. So, that's the whole point. Now, we have this big model. It is not perfect, but today we will focus on text processing. We will review a set of ideas. Before I start, I will explain some terminology that is typically used in natural language processing. A corpus is a dataset. It is composed of a set of documents. A document is the basic object in the dataset. Each instance in the dataset is one of the documents. It could be a news report, a paragraph, or whatever you decide. The features in machine learning are the content of the documents. We typically talk about terms, which are different from words. This is because the text has been pre-processed. The words are the raw information that you see in the document after pre-processing, such as removing wrong words or normalizing words. We have the same idea of data sets, columns, rows, entities, and features in machine learning. The first step is to identify the words in the document.With this Titanic data set, the features are already clear. Right? You already have the columns. Well, here you need to do it. So you need to be able to split the text into the different words. Okay. What do you think? Is this something that is simple, and is it just an out of the box function? As I am, we have an SDK here, or can you, and be aware of some problems that you may have at this step. Yeah, that's a good point, right? That's a good point. In fact, last year I was part of a natural language processing course for people working with Saudi Arabia. We had this specific program with them, so I prepared the materials to deal with the text in Arabic. Right? Because they had a specific set up documents in Arabic, of course, and as I understood in Iraq, that organization is rather painful, right? Because it's not so simple. You don't have, because this is what's simple, right? It's basically the whitespaces. It's the spaces and you work more or less. Okay. Not the same in Arabic or not the same in Chinese, in Japanese, even in French or in German, right? As I include here. Right, I do here some examples, right. So that the communication may seem to be simple in languages such as Spanish, but not only in all languages, right? In fact, it's not only because of the language, but also you will see here some well, a specific case system, which, well, tokenizing is not so simple as you see a whitespace that you have on your token. Okay. So, well, yeah. It might seem to be trivial. But actually it is not. Okay. It is not so simple to understand what are the individual pieces of content. And then, as we will see later on, in fact, this is even more important nowadays in the world of deep learning, because the way in which we organize information will help a lot the neural net, in order to understand what each one of the individual words is. Okay, because yeah, most of the more example of a language is used, because basically what we do now, what is we not to along with, say, processing in deep learning regarding to any sentence regarding the splitting the sentences into the individual pieces of meaning. We don't actually focus on individual words. But what we do is we focus on sub-words. Right? So, instead of splitting everything into words, we try to find the we try to also split the words into different parts. For example, this one right? And we can have the word-based organization. Basically, you have the word learning or learn, and the word deep learning in the extreme opposite. You can basically split everything by a character. And in the middle we have this sub-word organization. Okay, which is somehow split? Let's say the words into into the so. Why would we prefer this sub-word organization over this word-based one? Or can you say what do you think? And why this could be relevant? Hmm. That's a good point. For example, the word learning or learn. We may want to be the same thing, right? Because in the end it's the concept. Okay, but how you want to capture also these suffixes, which is the I and right. But in the end, we would like to understand that the basic concept is learned. Right? Do you need to understand that if you input these two words to them, these two words to the to your natural language processing system, they are going to be analyzed as a different word, as a different term. Okay, then, a problem which processing system will not understand that this word and this word are similar because they do have a different representation. Right? And what you will do is to create a number to represent this word, and you will see the machine learning system. We see two different numbers, and it will not be able to reconcile that these two they do have a similar meaning with the sub-word organization. It's able to do that. Okay, it's able to do that.As you split the terms into subwords, you are able to encode the semantics in the words. This organization is called Vp. We are not going to deep dive into how to organize that, but if you want more details, you can dive into this approach for a subword organization for your data. The idea is basically to try to do this. We have the extreme opposite, which is just to create an individual picture for each one of the words. Why do you think this is not a good approach, or why would you like to have a correct database of the presentation? This is slower, that's for sure, but we are talking about nanoseconds, so it's not a big deal. Additionally, another problem, as I see here, is the size of the representation. Now you have an original sentence and you are organizing that into typically hundreds of features and different characters, and for the representation that will involve creating a huge representation which is not good, because the complexity will increase. Do we organize based on the past sentence, pluto, etc.? We can't do that with the stem in the mathis. We will see that later on how this is done. We would like to take it to one of the individual pieces of the meaning and to see how they combine. We can do that, but we have seen through experimentation that the subword organization is even better. If you review the Vp organization, basically, why, it's creating a concept which is learned because it's a many different times in the dataset. It's an important concept. That's more or less the idea of how to do it. Again, I'll encourage you to dive into the representation if you want more details for it. Another thing that you may want to do is to normalize. What do I mean by normalize? Well, the examples that I put over there, we may have different lexical realizations of the same idea, for example, you.s.a.is the same as us. Again, if you input this to the representation, it will create a different representation for each one of them, because in the end they are different. We are able to understand that they are the same, but the computer is not able to understand that if we only see two different things. Because of that, you would like to normalize them. There are many other ideas. The point is that sometimes you would like to do a pre-processing of your dataset in order to normalize, so you understand that basically it's in conflict if it is written in different forms.You need to normalize a lot of the data for it to be okay. This normalization is rather manual; it's not something you can do out of the box. You need to define manual rules for the dataset you have. For example, if you are dealing with medical data, you may need to create a dictionary of normalization to understand that a given cloning is the same as another given token. Or, if you are dealing with a dataset on linguistics, you may need to normalize it to proper English. These are different things. Some organization problems can be solved via normalization. The most well-known strategy for normalization is to reduce everything to lower case, but be careful, because this default normalization can be problematic depending on the kind of data you have. For example, if you have many name entities, removing the capital letters can make it hard to understand which general motors you are talking about. In machine learning, there is no standard pipeline; it depends on your dataset and the solution you are dealing with. In natural language processing, it is even more difficult. When it comes to pre-processing text, you can use linguistic information to identify the root of the words. Here at the Repeat Organization, we do this in a statistical way, counting the number of times something happens.We don't know if this learning is the root of anything right now. The only thing it knows is okay. This learning appears a lot of times, so it has to be something important, but with no kind of linguistic information whatsoever. In fact, you can see here that the deep token is split into two okay, something that does not make any sense from there. But from the linguistic point of view, in order to try to infuse these linguistic information, we use a stemmer in our math. This is the terminology to do this splitting taken into account this lexicon performance. Okay. So in this example, we know what is the root of the word which is interpret, and we know that we have a suffix as a fix which is related to the past tense, and we have the operator that is actually modifying the meaning of the word. Right? So we can try to do this splitting, and then we can apply that to any session, or we can try to apply that to normalization, right? So we can apply, and then to organize, based on that. Or we can try to normalize by removing, for example, these affixes. And how is this done? Well, basically both stemmers and lemmatizers rely on a set of rules and a set of dictionaries. Okay for a stemmer. All the stemming libraries that are out there are based on some simple rules. For example, when you see a word that ends in Eb, it's because the problem of doing that is that sometimes you may end up with words that don't make sense, but because you are removing a prefix or a suffix, that doesn't make sense to remove. A correction of that is lemmatization. Lemmatization tries to do this normalization in a way that is a little bit more intelligent, if you want to call it, than stemming because they are using a dictionary, but it works in your language, and they are trying to find the word that makes more sense. So, for example, it's able to normalize words like our, or is to the root form of the word which is the verb to be. This is something that stemming will not be able to do, because it's based on rules and this base and basically on chunking the word. So as you see, lemmatization tries to create valid words in your language, because it's based on a dictionary. So if you understand that for your representation you will need to end up with words that are valid words in the language, you will apply a stemmer. If you don't care about that, you will apply lemmatization. Just to understand that. Of course, the lemmatization takes much more time than a stemmer. You can stem a large dataset of documents in some seconds. If you apply lemmatization, it could take 20-25 minutes to apply the lemmatization. So if you know that, can you think of a scenario in which you would like to apply stemming? For example, you don't care about having a representation based on real words in your language. Can you think of the other? For example, lemmatization. When would you like to apply lemmatization? Because you would like to have valid words in your language. Any ideas? For example, if you would like to offer a summary, you would like it to be based on real words, not to understand what I mean. Imagine that you want to create a summary that is about the misinterpretation of this new report. If you apply a stemmer, it would say, this week talk interpret news report, right? So it's not. It's not this kind of of, but it's quite yeah, that's an example, an example of a stemmer. For example, you might not.Now you want to classify this into politics and sports. The final output that you see to the user is not the truth itself, right? It's just the classic politics and sports. So you don't actually care about the representation that you are creating. You don't carry the representation that is created based on terms or tokens that are not actually valid words, right? Because this is not what you will offer finally to the user. But in general, in any task that you would like to offer the final result of your model to the user, for example machine translation, you would like to offer the translation that is a valid sentence in the current language. So basically, when you are offering the final representation to the user, you would like this representation to be based on how it works. So you will need to pre-process it. If you don't care about that, you can just apply stemming because it's faster. Okay, that's more or less how you would like to base your decision in case you want to apply this stemming or the matrices. Okay. So to an extent, we are able to find the individual units of meaning or words and normalize them. You can even have the system in automation to further normalize them. Okay. Now, the next step is okay. You have to split all your documents into these words, whatever your organization and but we don't want all of them right. But we don't want all of the words to be included in the final representation. For example, if I think of an example, the car is red, I mean, words like that in the next system something like that. Okay. The term that perhaps it's not very interesting. Right? Imagine that you would like again to do this classification between positive, negative, sports and politics. Right? Do you think words like that in a sentence are going to be useful for the next classification? Not really, right? No, no, not really. They are, as I say, here they are. It's not only that they are extremely common, which means that they appear in all the documents, so they are not discriminative, but also they are not very informative. Right? They are not giving you any information about the class that they need to pick or about the task that you want to solve. Right? Typically, we refer to them as stop words and you would like to remove them. And I mean correctly in practical ways. You have this list of stop words for all the different languages, and what you do is you take this list and you remove them from your data set. Okay. Is this something that you would like to do always? What do you think? Would you like to always remove stop words? In what case you don't remove the stop words? For example. Yeah, that could be that could be one way. Some right? In the end, when you are moving this towards you are missing some information depending on the problem that you are trying to solve. You may want this information. And I know the case in which you would like to keep the stop words with the stemming. Imagine that you would like to then offer an output. Because, for example, it's machine translation. Right? If you have removed the stop words, the sentences that you will create, they will not be coherent or lexically correct, because you are not using all of this context. Right? So again, there is not such a thing as a one-size-fits-all pipeline. I said to you at the beginning of the class, these are just different resources that could help you, based on the product that you have on your hands, you need to be able to understand it. This is something that you may want to do because perhaps this is something that you will need to keep. Okay. Is there not removing the stop words a default process? As we will see. Okay.If you are taking a course on deep learning for natural language processing, you may be told that manual pre-processing steps are no longer necessary. This is partially true; if you have a lot of general, well-written text, you don't need them. However, if you are dealing with specific language or text with little information, you still need to pre-process and normalize it for the machine learning model to make sense of it. Even if you have a collection of reports in English with billions of them, you still need to do some pre-processing steps. A feature vector that has the terms of the document. So for example, if the document contains the word happy, you would like to have a feature that is called happy and has a value of one. Right? So this is the idea of creating a feature vector. So you have a feature vector that has the terms of the document. And then you have a value for each one of the terms. So this is the idea of creating a feature vector. Okay, so this is the idea of creating a feature vector. So you have a feature vector that has the terms of the document. And then you have a value for each one of the terms. So this is the idea of creating a feature vector.It's one of the terms that appears in the document, right? And what is the value that you have over there? What could represent this value? That's a little bit. For example, if they are present or not, if the document they will be, they will have a certain percentage to the document, right? We will see different ways of doing this representation, but that is the whole idea, right? What I want to do is not to think that this is so different from the traditional machine learning schema that you are used to right now. Because in the end, later on, remember that we would like to apply a machine learning model. So you need to be able to create this feature representation that we created for the machine learning. This is no different, right? This is nothing new, because sometimes all of these complex apparatus that run in natural language processing can hinder the actual thing that you are doing, which is not so different from the thing that you already know how to do for machine learning, but which is from your documents and your input data, create a representation. But then you can input into your machine learning model. That's it. That's what we are trying to do here, this feature vectors. They will be that. So what we are going to do in the following classes is to review different ideas and ways of creating this feature. We will start today by the most simplistic ones. We will see the problems, and we will keep building on those in order to finally arrive at the current representations that nowadays we are using in the programs that process language based on large language models. Okay. The whole point of so-called GPT and all of this approach is that this part is done properly. And by properly, I mean that this I want to quote all of this lexical, morphological, syntactic, also Monte Carlo pragmatic information that we saw in the previous class, right? It's able to create a representation and call in all of these aspects. That's it. That's the only advance that nowadays these approaches are creating, creating a great representation which is not trivial, of course, and it will require tons of data, and it's very clever. But the only thing that we are doing is improving this part over there. Okay, that's why they are able to do all of these crazy things because they are able to create a proper representation in the form of the feature. Okay. Is this clear? Is that also the knowledge between the feature of it, or see around the data, we happy machines that meet you. Yes. Okay. So let's see how we can start creating this representation. Hold it. So let me actually start by this. Okay. So how we can create this representation, right? So you have this document, right? And you would like to create a representation of this document. You could. You would like to create a vector by this feature. It's one of the features that we have agreed that are. It's one of the words that I have and some weight of these words. So the basic approach for doing that is what we call the bag-of-words representation. The bag-of-words representation is this analogy: you put all of the words in a document. Remember that you have to recognize what and how you would like to normalize them, as we have seen before. After you have done that, you put all of these terms into a bag and start taking out terms and counting the number of times that one of these terms is happening. That's the basic idea, right? That's the basic analogy. So from the original document that you see on the left, you can create this kind of bag-of-words representation or list of different tokens, and each one of them has a different count. Right? So this is what we call the bag-of-words representation, because it stands for bag and frequency. So this is the basic idea of the bag-of-words representation.This is what you see here: the back of your representation is creating this instructor. This tractor is commonly referred to as the document term. Okay, what do you see on the screen? So yeah, oh, go to. There are the different documents that you have in your data set. In this case, we are trying to create a representation for all this experience. Okay, so it's one of these. It's a different play, grouped and by sex. Okay. And what you see on the columns are the different terms appearing on these documents. So, for example, this sale over there means that the word 'Anthony' appears 157 times on the play 'Hamlet'. Okay. So this is actually, do you think about that? This is actually a feature vector. Now you can take it one of the place and you will have a feature representation of it. Right? So you will have for it one of the features, feature one, feature two up to t2 next the representation. Take this idea and use the same analogy with machine learning passengers of the Titanic. In the rows, it's one of the different features and in the columns the values, the same idea. Okay. So this is no different from the kind of data set that you will have on machine learning, right? The kind of data that you have for machine learning. So now you can take that on to input that into a machine learning system, and imagine that you have a target variable right over there. That is, I don't know, plus one, four plus two, but one two one two. So you can use this feature representation to learn this that it variable. Okay. So you can actually use this representation to do not only language processing for text specification, or for information or about whatever. But that's the typical representation, that's the most basic representation that we have here in natural language processing. Of course, you can use an even more simplistic representation, which is based on the binary weight, meaning that you only have one or zero, if the word does appear in the document or does not appear in the document. Okay. Is this clear? This representation? Is there any doubt? So we understand what we are trying to do here, right? And we understand that basically we are trying to create the same kind of representation as the one that we have to machine learning. So we can put that to a machine learning algorithm. Okay. Well, let me tell you. What do you see in this representation? Considering that we are dealing with textual information, what limitations do you see? Any ideas? Well, that's good. Right. So, for example, we would like to understand that 'Brutus' and 'Anthony' are somehow connected right and over there you don't have this information right? Because a lot of features for machine learning, feature one, t0, 3, 5, 3, 2, 6 in machine learning. What do we assume? What's the relationship between the features? Are there correlated? In fact, you have learning machine learning some methodologies to avoid this call linearity between the feature right. This is something that is not deciding. In fact, if you input features that are correlated into most of the machine learning models, they will have a hard time detecting that right? So most of the with machine learning they do a sh independence between the features. This is not right in text, right? We would like to understand that some of them are related right? What is the other thing between the John? Consider that 'Brutus' and 'Anthony' to be rather unconference? Right? What is the other thing that we are missing with this representation and why are we missing that? How would you include this year, or a meaning of the document? So the syntax meaning the relationship between the words right. A little bit later in the class we will talk about the dependency parser for trying to understand how to do that something more important. When I'm talking, we are using that to a language.Can you play the next word that I'm going to say? Oh, that's not a language. We have this sequential relationship, right? So one word is strongly related to the other. So we have this information, that is, according to the order of the words. In other words, they do appear together in the document in consecutive positions. It's an important information. We don't know if Brutus and he said, they do appear together in the text. How's he's bread when we encode that with this back of work dependency, we don't know that these two words are related. We will see how some right they do appear in the document. I don't know what order I'm missing the order which for natural language is kind of a very important information. So yeah, that's right. That's some limitations that we have with this representation. So back of what representatives we use that a lot previously to be learning, because it's a rather simplistic approach, and for many as an others, it works particularly well. We are going to have our practices here on text classification, and we will see that these very basic approach works rather well in offering a rather similar performance of the development approach. Right? So don't discard that out of the box. And so we don't discard that just by default, and just apply out of the box machine learning solution, because the other presentation could be, could be would. Of course, if you are trying to do something more advanced, such as machine translation, creating a chatbot, question answering, implemented. So this representation is not just enough because we are missing the importance of the order which for natural language is kind of a very important information. So yeah, that's right. That's some limitations that we have with this representation. So back of what representatives we use that a lot previously to be learning, because it's a rather simplistic approach, and for many as an others, it works particularly well. We are going to have our practices here on text classification, and we will see that these very basic approach works rather well in offering a rather similar performance of the development approach. Right? So don't discard that out of the box. And so we don't discard that just by default, and just apply out of the box machine learning solution, because the other presentation could be, could be would. Of course, if you are trying to do something more advanced, such as machine translation, creating a chatbot, question answering, implemented. So this representation is not just enough because we are missing the importance of the order which for natural language is kind of a very important information. So, as I told you in the following sessions, we will try to address how to incorporate this information that we have in text, the semantics of this syntactic information, thanks to the use of large language models. Okay, but nowadays we are sticking to this representation. Today we are listening to your representation and we see that we have all of these limitations. That was a traditional representation previously. Learning is not actually true. Okay, we are missing some piece of information here. This example, this example. We are seeing Csar a. Pr. 227 times. Here cs our appears 2 times. So we can assume that Csar is 100 times more important in this play done in display. Is this true? I mean, it appears 100 times more. 100 times more. It is because it 100 times more important. Right? Nope. Can you see why forget about you? You see around Hamlet? This is a book this is up with. Do you agree now that it's 100 times more important? No, why not? We? Why not? Exactly. You are seeing words in context, so it may appear more times just because it's longer. So we would like to account for that. In fact, this is what we call the Tfidf weighting. The Tfidf weighting is based on two different measures: the term frequency, which we have seen before, how many times a word appears in a document, and what we call the inverse document frequency. The inverse document frequency is basically trying to account for how many times a given word appears in the rest of the documents. So if a word appears in many documents, it's probably not that important. So we will try to account for that.So, what the TF-IDF is promoting are works that appear many times in a document, but not in the rest of the data set. They are actually awarded for this. As the document is longer, words can appear more times, sometimes in another document, because it is a rather common word. So, we would like to focus on words that appear many times, but only in this document. For example, if we have a new report on a plane crash, if we only focus on the frequency, it will give us words that are frequent. If we include the embarrassment frequency, that is, words that only appear in the document, but not in the rest of the data, they will give us words that are important. We can use TF-IDF to better separate the data set with tweets about a given disaster, as the representation is better. However, this representation has some limitations, as it is not able to encode the order. On top of that, we are creating a representation that is rather sparse, meaning that 90% of the matrix is full of zeros. We can refer to language models to learn automatically, as this document term matrix representation is something that needs to be created for each document. We can also use transfer learning to reduce them across domains. This is a projection of all the real dimensions based on the features into two dimensions, using something called TF-IDF.Which is similar to PCA; you have hundreds of dimensions and you try to reduce them to two dimensions in order to be visualized. Yes, it can. You just do this reduction to visualize them, and hopefully you are able to see some structure that you have in higher dimensionality into the two dimensions. Its name is t-SNE and in case you want to, I can tell you a little bit about that, because this is a way of using the dimensionality of deep learning. In case you want to plot them to see the distribution of the data. So, okay. Any questions so far? Because now I would like to introduce a couple more concepts related to these for these purposes. Okay. So far, what we have done is we have created this document. To do that, after all of our original documents, we have to split that vector into individual terms. We have to analyze them if needed, and we have to remove this. Of course, if we want to, we are for each one of them in the document, in the matrix, we have the value for the term in the document. The TF-IDF is trying to account for the number of times that the given term appears in the document and normalized by the number of times at all. So this token up here, single rest of the log. Hopefully, we will be able to identify relevant tokens for the document. That's some vector representation. But previously we have talked about, we are not able to encode the syntactic information. If something is related to the other or so. So we have in the past some clever ways of doing that. This is something that we are not doing so much anymore, because now we fully rely on neural nets in order to do that. But I guess that this is something that you would like to introduce, because again, for some manual ways of solving natural language processing, this could be relevant. This is something that you would like to do, and also because even if this is something that is internally done by the neural network, you would like to understand what's going on. You would like to understand why the neural net is able to do that. In fact, there is a very nice paper that I will introduce in the next session, which is called. They discovered that they in their layers in the model they are learning how to do t-SNE on the dependency. Part of that we have to see like the best. Okay. So even if this is something that is automatically done by the machine learning model, it is something that you would like to understand. What is it? In some of the scenarios, you still would like to do it. So what is part of speech tagging? We are still in this phase, we do have a purpose that we would like to create. In fact, what we would like to do is to enhance this document term matrix based on the TF-IDF that we have already created. So part of speech tagging is this idea of detecting what is the role of each one of the individual words in a given document in a given sentence. Okay. So a word could be a noun, it could be an adjective, it could be a number with part of speech processing. What we see is just a list of the different parts of speech that I work with and a half on a sentence and some examples of it. Right? So you would like to do that when you analyze something. You would like to understand what is the role of it, because that will be relevant. For example, the word book. Could it be a noun or it could be a verb depending on the role? If it is a noun, you would like to understand that perhaps you are talking about the library. If it is a verb, perhaps it is because you are talking about reading. So you would like to understand what is this word. So this is what part of speech tagging is doing. Okay. Can you think of a way of doing that? Yes, a very simplistic way of doing that would be to use a dictionary structure. This would involve creating a dictionary that contains words and their associated parts of speech. This way, when you see a word, you can look it up in the dictionary and determine its part of speech. However, this approach has its limitations, as some words can have multiple roles depending on the context. To address this, you could use a probabilistic model that takes into account the probability of a given word having a certain part of speech, as well as the probability of a tag given the previous one. This way, you can create a model that can better understand the context of a sentence.We have a sentence with post-studying. We are able to understand the role, right? The one of the individual was so. This is our noun. This is a little minor. Your precision. Now, on top of that, I would like to understand the relationship between one of the works. So I understand that this is the member, right, and that this is the subject, and that I have two complements of the member, right, so I can understand the structure of the sentence in order to understand the meaning of the sentence. So what we discussed previously, right? I don't know what is the relationship between the words in my sentence. Because I'm using this back-of-the-envelope presentation, that is, hey? Losing the or that I'm gonna relationship between the words. I can't try to call that with dependency parsing. If I apply a dependency, I would like to know that this word is related to this other. And I can use this relationship to include this information into the document or matrix, or in case I don't want to create a documentary matrix to basically use this information. For example, if you create a chatbot in order to understand what is exactly what the user is asking for the quantity and so on. Okay. And this is like you will find at the end. Some information on how to do that. Okay. Some information on how to do these on the different ideas in which you can apply applied parsing. But the very basic idea is that we now have this same method of detecting the structure of the sentence. Okay. Again, how this is done is something that is included in the slides. But basically again, we have a probabilistic model that is counting, as always, a half an or not 80 dataset with sentences annotated with these relationships. And I'm learning from that, right? So I can learn a probabilistic model done when you see the word "so" appear in a sentence, it's more likely that the next word is a noun. It's very likely that is because it's an object. And so you are learning this probabilistic model. I think that it's not very relevant to understand how this is implemented or how this is learned, because in case you will end up using the parsing, it's you don't need to implement your own parsing person, and in the forum session I would give you some resources on kind of dependency parsing systems that also they are included in this last. In case you want to apply. As I told you previously, I don't see if you will end up using these so much, because if you are lucky enough that you are dealing with a well-studied domain and what they studied language it's rather typical that you already have a parser and you don't need to take care of it. In any case, I don't think that you can call yourself a natural language processing expert if you don't know what is possible in all the dependency parsing, right, even if you are not using it. And even if you are relying on the parser. Okay, I guess that this is too relevant. If we were 10 years ago, this is something that we would need to devote a couple of sessions because you would need to understand that in detail, how to implement these parsers and how to use them and so on. Not anymore. Okay, again, in case you use it, you just say, use some library, and then it delegates parsing over to this. I hopefully see that this is already done for some tasks, because, in order to know this, good. Okay. You have this light on your disposal, and you can review them here. You have more details on how to trade these dependency parsing systems, and some examples, and how to run them as well. And please review them on the view of any. Just let me know. But as I told you today, the only thing that, because there are a lot of materials, and I don't want to overwhelm you. The only thing that I would like to do is to just give the pointers to each one of these concepts so you can review them on your own. Okay. We can use the phone session to discuss them.For the assignment, you will have three options. You can implement a natural language processing application, review an existing language processing application, or research a field of natural language processing from the point of view of research ideas. It will be a group assignment, so you can self-enroll into groups. I think they are already open, so you can access them on your own. Remember that I expect the output of six or seven people working together. The quality and amount of work should look like it was done by one person.I don't expect you to load a pre-compiled model into a dataset and send me a Python notebook with 25 lines of code. That's all I'm asking for. Honestly, you can do this alone in a couple of weekends, right? No, but I expect to see something bigger and more interesting. But please don't do this. So, okay, we are seven people. I'm going to do it, and you do the assignment? All that is up to you. If this is the quality of the assignment you are going to send me, you can understand that the grade is not going to be very good. I don't expect you to implement Alexa or a GP, but I expect you to implement something nice. If you feel that it's too much to implement, I was presenting, or your group, I would prefer that you read some papers, review a lot of information, and devote one month to doing that. Of course, you won't be working on it the whole month, but I expect you to put some hours into it. Once you read the guidelines and agree on what kind of thing you want to do, let me know what you are going to do. We will go for a natural language processing application. This is the topic we are going to review, so I can understand if it is too much. If you feel that way, let me know. I can give you some ideas and advice, and I can understand if it is too much. But then I expect you to read some papers, review a lot of information, and devote one month to doing that. Okay, of course you won't be working on it the whole month, but my point is that I expect you to put some hours into it. So, and the kind of research paper that I would expect to read, to write, to review. So that's basically it. So, hopefully, you will be able to parallelize the work and create a nice job and idea. Let me know if you have any doubts or if you have any questions on what you are doing, if it is going to be enough. Just let me know. I can give you some advice and ideas based on the specific thing you are doing. I can give you some details on if what you are doing is enough, or what you are doing could be improved. So as soon as you have it clearly, just let me know and I can validate it, and you can start working on it. During the process, let me know if you have any doubts or if you have any questions on what you are doing, if it is going to be enough. Just let me know. I can give you some advice and ideas based on the specific thing you are doing. So, that's basically it. So, hopefully, you will be able to parallelize the work and create a nice job and idea.Let's not do it together, because I don't recommend you start working on anything unless you are validated. Okay, I'm going to work on that and let you know. We have to let you know when you have to give us your okay. Well, I can tell you which one is the most interesting, which one has more potential. That's the idea. Okay, I will give you some details on the guidance, on some places you can read and review papers, and you can replicate this instructor. But yeah, that's the idea, read some papers, but you won't have time to read 100 or 200 papers. That's what you do when you do this kind of review. But hopefully you will read some papers, some ideas, or some high-level blogs, and you will be able to understand that the whole point of this research review is that, imagine that I don't know anything about the field. So by reading your review, I should have enough to understand the current state of this field of research and what will be the future lines that will be helpful to follow. Sure, I can do it. I mean, if you prefer that I release them, I will review every day the one related to. For example, if it's Monday, I will review today the things that are happening on Monday, and tomorrow the things that are happening on Tuesday as well. Okay, because in this way I can properly focus on the actual topic of the day. If you prefer me to do that. So Monday, you can just do everything well. Yeah, okay, that's fine. I can do it. So you have all the materials available. Anything else? Okay, right. I think we can call it a day. So yeah, after the class, perhaps not now, but during this week I do recommend you go through the slides, read them, and try to understand more or less all the different ideas. And if at some point you don't have it clear, you know. Okay, and in any case we will review this from the practical point of view. Okay, great. So a nice plan is we can see you on Monday.