Spaces:
Runtime error
Runtime error
Hi! Hello! How are you? Sure. Well, it was a quite interesting challenge. I think this year you will enjoy it. I imagine that you have the necessary resources to be successful and get good results. Well, I do have to tell you that this is basically data science and machine learning. It's going to be very similar. When you start working on a real world problem, the process is going to be more complex than it looks. In a couple of cases, you need to sit down and think a lot about what you are doing. Machine learning involves a lot of time of experimentation. The only thing that we can offer as data scientists is to have a systematic way of dealing with the data, processing it in the right way and testing different hypotheses. But in the end, it's a rather time consuming process, that's for sure. So yeah, well done for participating and being able to implement a reasonable solution. I think that it's something to be proud of, because, as you said, this is not the only thing that you are doing. Okay. So, good afternoon, everyone. So I called back to natural language processing. The idea today is to talk about information retrieval, to explain a different problem from the one we reviewed last week, which was text classification. The idea today is to dive into this idea of information retrieval, and hopefully we will see the impact of many different solutions that you can implement on a usual search engine. We will play around a little bit trying to implement our own information retrieval system. Before moving on, I would like to quickly review what we discussed last week about text classification. Okay. So after some weeks dealing with the basics of natural language processing, the building blocks of natural language processing, and after reviewing language modeling, which is now the representation that we mostly use for all the different natural language processing problems, we deep dived into the specific usage of all of these concepts into a particular task. The task of text classification. It was not only about the classification, I guess. It was also about how to apply all of these ideas into a particular problem. So, hopefully you have a better understanding of text classification both theoretically and practically. You are now more confident and more familiar with the concept of transfer learning and fine tuning and the actual usage of deep learning models for real world problems. So as we have discussed, the very basic idea nowadays on natural language processing is that you rely on some kind of pre-trained model. In practice, for example, we use a language model based on the idea that these models have been trained to predict the next word or predict the next sentence. Well, this task could be interesting in order to generate new text. The idea is that by means of doing that you are going to generate a representation of the language that is very accurate, and then you can reuse it for many different tasks. This is called transfer learning. So now the model that you train for language modeling, you can retrain it as we did for text classification. So, instead of predicting the next word, you need to understand the text. In this case, in the classification you don't need to have a truly big dataset for training the model because the model has already been trained on language modeling, on understanding the language, which is a difficult thing. So then, with a small supervised dataset as the one that we used in practice, with some thousands of documents already annotated, by fine tuning it over the language model, you can get very good results. Remember that the idea is to retrain a bit of the model, although you may just as likely change the way it works. You don't need to train them all from scratch. You want the model to keep the understanding of the language that it had. But now you want to put this understanding into practice, and if tomorrow you want to take that model and use it for a task such as machine translation, that's going to be exactly the same. But instead of in the model tech classification, you will do it on the task. If you want to dive a little bit more, I recommend you to start checking the Hugging Face repository that we discussed the other day, and that we were using. Over there, you will find models that are already pre-trained for a specific task and for a specific language and domain. So you may have Spanish, Japanese, or any other language. You can find them for financial information, medical information, or any other type of information, and even fine-tune for a specific task such as classification or machine translation.Actually, sometime in the future you may want to better understand or deep dive more into the concept of information retrieval. I strongly recommend you to read this book and to refer to it. The nature of the information that you are finding is usually unstructured; for example, text does not have any kind of structure or images. But by unstructured, I mean that the counterpart of information retrieval would be a well-structured database, such as tables, records, and features describing the tables. Here we are talking about the size of the web or a large corporate database. We are not really talking about a handful of documents. We are talking about pieces of information, such as text, video, and images. And when you use Google or any other kind of search engine to retrieve information from a rather large collection of documents, you are looking for information on the web. There are other areas for planning information retrieval, such as searching for documents, finding closest legal documents, or searching medical cases. In a corporate database, you would like to find some information. In the following sessions, we will try to construct and explain the fundamentals of information retrieval and then later try to implement those concepts.A very common problem in large corporations is that there is a lot of information spread across many different systems, and it can be difficult to find it. But it is usually in one document. So this is information that you will likely need. If I have a meeting, I would like to route the information related to the project, no matter how organized the system is that you are using. Sometimes it is difficult to get access to it because many people are collaborating together on the project and generating information, and the information could be stored in different places. So, in short, whenever you have this kind of information need, you have an unstructured nature of information. What does this mean? Here I am just including an example of a structured versus an unstructured information. The rise of unstructured information needs to be structured. For example, this is the SQL query, we are querying here the Wikipedia, which is a database with the information we have from Wikipedia, and the language that you see over there is this particular, which is very similar to SQL. We are trying to find all the scientists, all the spine scientists that we know about in Wikipedia. Right. So here we have a structured information need. The database is structured, and as I know this structure of the database, I'm able to put in the database in that way. So select person from and DC. Given type. Well, this is great if you have the information structured in this way, and if you know how to put in this information. But, as you may know, for those who have had experience with this kind of databases, it is not straightforward to use the support databases. Right? It's not always simple to understand what the actual query is that you need to perform. In contrast, if you go to Google, you can just use natural language to type. I would like to know this pen is there? I would like to know names about this spine scientist, and you will get the information. You will get the documents and the results. So of course, this is something that is way easier to use. It's what you have to use because you don't need to store the information in a specific way. You can just leverage on control documents that you may have and allow your final users to use natural language in order to search for these documents, and they don't need to learn any kind of complex and guidelines. For example, in the scenario I presented about the corporate database, and we want to find documents relevant to a given project. You cannot expect your business stakeholders to learn SQL in order to retrieve documents for the next meeting or contracts. If you have this kind of method, it is rather simple to just search through the information in that way. Okay. And here I have just another comparison, a comparison between structured and unstructured information. I think the key point here is that now, around 80% of enterprise data is in the form of unstructured data. This was not the case before. In the eighties and nineties, basically all the information that we collected from corporations was this kind of information, about tables and relational databases, and so on. So being able to store and retrieve this information by means of SQL databases or C. In our databases was very important. Nowadays, most of the information that we get from our corporations is of an unstructured nature. This is great for our consumers, as control documents. So I mean all of this information. It is key to have a way of structuring this unstructured information.Well, at the store, aren't we three? This information is not more or less seamless. That's why information retrieval is so important nowadays. So what we don't do is Google in many other scenarios for information. It's a key component of many systems in order to be able to retrieve the relevant information to a given task. Information retrieval is something that you can safely apply. Okay, so here we have presented what is the overall idea of your permission, but three more. I hope it's clear. If you have any doubts, you know. But again, if you would like to refer back to the original example, you can understand Google as an information retrieval system. Now, what I want to do is to introduce what is called the classical model of information. The idea is that from a more or less theoretical way, I understand the different pieces of information for the system, and not only of the information, but the system, but the different things that you need to consider when you are implementing an information retrieval system, because this is related to how the users are going to use this. Okay. So the first thing about information retrieval is that you have a problem, you have a task that you need to solve, and you don't know how to solve it because you don't have the information. So here in the example, I guess that this situation is familiar to you. You are programming in Python, either because of the data that we were talking about or the practice of natural language processing. I want to assign it, whatever you are trying to implement in Python, and you get stuck. You don't know what's going on, but you know that you need some help. Okay. So you would like to be able to formalize your information to conceptualize. I could say, not formalize, but conceptualize information. So you need to be able to move from data to somehow the information that you have, which is okay. It seems that there is something wrong with a dictionary, and I guess that is about the key. So I need to be able to understand how to solve this error with the case on that. So the first thing that you need to do is to conceptualize and formalize the information. Of course, this is outside of the information retrieval system per se, but this is something that you need to understand that is good to happen. This is a process that the user must do, and this is a process that they can limit upon their potential usage of your information retrieval system, if your users are not able to properly conceptualize and formalize the information. Sorry. Okay. So as I was telling you, you need to be able to understand that there will be something that would happen right that your user is not going to be able to properly conceptualize and formalize the information, and you may need to help the users with that. So, for the sake of the example, imagine that you have been able to, but not you. But the user has been able to, they find the proper information. So now the next thing that they need to be able to do is to be able to properly formalize the query from the information. Actually, what is the query that you need to perform? You need to introduce into the information. And there are some people that are quite good at this, right? And there are some people that are quite without creating the information for a while. We are seeing that in similar applications, because they are able to properly and formalize the query. For example, this is something that is quite different. When I'm using Google, for example, I'm always using this keyword like a style here or dictionary python. So I'm able to properly formalize the information that I need to retrieve.Well, for example, my parents are using the computer and Google. They are writing and sending information about a particular restaurant they would like to visit. It's a different way of formulating the information, but different users will have different ways of formulating the information. If you are not able to formalize it correctly, you will not be able to retrieve the information again. This is something that is completely outside of the file system itself, but as a designer of the information, you need to understand that this is something that is going to happen. Perhaps you may want to offer different results, so you may want to offer suggestions on how to solve the query, for example. We all know that providers are doing this when you type in a query, they give you suggestions about what is relevant. Once you have that, you can talk about what is an information system. Here we are with the ideas that you are going to have a search, and it is something that we are going to try to explain and implement today. You will have some core putting with you expect to have the answers, and if you are not happy with them, you can always refer and refine the query and start the process again. This is called something that we usually do. We go to Google, we get the results, we see that the results are not what we expect, and we change the query. So that part will be the actual information. And this is the part that I will be explaining today and implementing. But the idea is that, as you see, the information about the system itself, it does not work in isolation. It works in a classical model for information, but you need to be able to take into account it. Okay, so far, so good. So let's get our hands in. Let's try to actually implement the information control system. Okay. So let's go back here because I would like to ask you about how to do the information. I would like to remind you that you have this situation: you would like to implement this system in which you have some documents and some queries introduced by your users, and you need to be able to retrieve the documents that are relevant. How can you do that? What ideas do you have? Let me hear your thoughts. What are the different ways that you can do that? Imagine that you would like to find information about restaurants in a certain city. If you introduce the query, what is the most simple way you can do it? You just retrieve the documents in which it's a one and California, it's a 0. That's the main limitation of this approach.You search for that, right? Actually, you can do it. Whether because you have the vectors in weeks that are the documents in would appear, the documenting would be set up here as it would, the work commencing would have forming up here, so that you can apply this by wise, and in order to retrieve the documents that we are here and here. As we don't want California to appear because of the quid, because we want documenting which we to see, you negate this vector and basically what you have over there as a result of this is there. And is the result? Okay. And this say, no, you call, and it retreats that meaning that the documents are document number one and document number four. If you go there, you see that actually, these are the documents invoice you be able to. So we can leverage on the documentary. We don't need to review all the documents because we have this scrapped out of the document. So the only thing that we need to do is to just do this and logical ask, which is something that is rather fast too. So you see, we are solving the problem of searching for information in documents of an instructor, not nature, but actually by providing some structure in this document right in order to speed up that process. So this is something that we can leverage something up from that. Okay. That works faster than the previous approach. But it has still some limitations. What are the limitations that you and b him for this approach? And it's actually related to one of the limitations. If you remember of the document that my any idea. That one okay, in the binary case the frequency is not taken into account, but we can always leverage some of the frequency and Tf. In fact, this is something that we will use, as you will see. Yeah, yeah, just for the sake of the example. I'm using the binary waiting just to make the concept clear. Hmm. Okay. But again, that could be a limitation. But this is something that has more to do with the actual representation of the information, not so much with the limitation of the of the of the approach space of the documentary matrix. Here the main limitation is because of the documentary, because of how the information is organized on the store. We are meeting somehow the performance of this. Okay. So let's think about this documentary matrix of one of the things of these documentaries, because well, this play example that I had. If you remember the document from Matrix, we talked that it was very sparse right, meaning that most of the sales and this Patricks they are because most of the documents are out of the world. They are not so much related to each other. Right? So basically it's it's 95%. So you have a lot of one of the limitations of the locomotive matrix. I again. I think that we are. Remember that we are going to search on a purpose of really big documents. A lot of documents is that you are studying a lot of information that you don't need. Here you have some numbers right, for example, up rather small, 1 million database with one worth 1,000 words per document. It will take around 60 copies of data that you need to be able to store in your long memory on the this type of matrices. It's it's basically trillions of 0 and one some 16 is parts. Basically you are still in a lot of information that you don't. So even if this why, you're could be more or less fast, as you have this very big documentary in memory. The process itself is not, is not really well. It's not performing as good as it should. So the those and what we have to that we are going to implement here is that we are going to only record the ones in the documentary. We are going to only record the actual relationship between the words and the and the documents. Okay, not all the information. So the document. We are going to move to the what we call the inverted index. Okay.The index is like a structure for a dictionary, where you have the increase of the dictionary terms. For each one of the keys, they will be what we call the stories for each one of these words. For example, for the word Bluetooth, what are the documents in which they appear? So you go to Bluetooth and you will see that in Document Number 1, you have a 1, in Document Number 2, you have a 0, and in Document Number 4, you have a 0, meaning that the word Bluetooth does not appear in this document. Now you don't have this problem of the position. By going to this, we are able to compress the representation and in this compressed representation, you will be able to quickly find the documents in which the words appear. So now, if you want to search for a phrase, not just a collection of keywords, but an actual sentence, you can use what is called phrase matching.I don't want to talk about it. I want documents talking about the best. Okay, this is the same. Set this same phrase. Is this off the subway there? These queries are called free school. Do you think that with this position on inverted index, we will be able to read the information from this? For these queries, what are they? Yeah, that's what we are missing. Right? The position of the words in the document. They order up the document. So the extension of this is the index in which you store the document with the terms that appear, but also you will start the document in which the word appears. Right? So it's like this kind of nested dictionary, a dictionary. The key is the term and the values are part of the document IDs; these are actually the keys for another. You have the process, so I hope that you see that right. With this I mentioned that you don't know where in the document, and in which position of the document the words appear. You don't know the exact order of the words, and you don't know that the words appear in the order based on the index. And first of all, in Madrid. Okay. So this is actually what we want to do. So now we have a position index in there to answer with the It's a simple. It's a simple retrieval, because you can just go to the workflow it over the like. The documents in week, both up here in this case he's looking for only they. They both up here, and you can see that you appear as in position 16. So okay. You now know that the words they and they weren't sorry in the document there for these words they do appear, and they do appear in the show. Okay. After what I told you that the the in that, since they is they what we are going to try to create in the practice is not true. Is this position, or in with this position that we will have? This is that it will allow, and it will allow us to not only answer this, but also be able to answer this. This is what we are going to do in the practice. and this is the basic structure that is enabled in this information. Then you can complement this structure with all the other features that we have to at some point instead of just. You can take into account the number of times that they do appear to. You may want to also take into account to use a better representation of the words. But the idea is okay. The idea is to store and this work, whatever you use for representation. and then to a So actually. we are somehow structuring the data. So at the very beginning nice, we started by by explaining that information what everybody is trying to. and the best way to do that is to try to create some structure. I mean that of course we are not we. It is something as complex as a sequel. But you see that in the end, in order to speed up the process where they want to, we need to create some some way of a structure in the data. The way in which the facing would you create in this position and part it? Index is called the Index. In fact, right. this is something that hopefully you do. Once you create the the information. and then any time that I you use the same to this in a new point you just leverage for this. So the indexing phase is something that is, however, as low. But you don't care because this is something that you don't, but it's a very long time if you collect new documents, but you can introduce them on box on the next day or something like that. So what if the actual indexing of information this is long? You don't care because you want this. so you seem to be a fast up with your time at the inference time. Right? You want this almost real time because of this. Okay, question. So far. Well it is, it is, I mean, it depends on the size of the dataset. Okay. it could range from several hours to several days in the practice. In the practice we we are going to see that we are able to index around 5,000 documents in. I don't know a couple of minutes.Okay, so it's not paying for me as well. But of course, if you have a database of billions of documents the first time you index them, it's going to take a while. What's the point here? Once you have indexed them, you don't need to re-index them. Tomorrow, if you have more documents, you can just add them to the index by updating the index. Now, you are talking about document number 8. For example, you have a new document. This is document number 8. The only thing you need to do is create another entry for the document number and extract the words and the positions. Now, for example, you have position 400. All the rest of the data you want me to talk about. Okay, you will need to update the index. The first time it's going to be slow when you index the document, but re-indexing, as I told you, if you don't want to keep launching indexing processes, you can just wait and add several documents at the same time. Okay, so let's say you create a SQL database. The first time you input the information and the SQL database, it's going to be slow. Updating the SQL database is not so much. With the information for the SQL database as well. Okay, in order to put this into collection, we are going to leverage a library called Elasticsearch that is built on Lucene, which is like the standard library for information retrieval. But the idea is to create an in-memory structure, which is like a basic picture, right? It's not a Python dictionary, right? It's some tricks in order to store that in the computer memory. Okay, this dictionary will look like an index file of a given format, so it's easy to understand for the last, except for any other tool how to deal with this, but it's going to be a big file. Actually, it's going to be a large collection of files in order to not have a file with 25 GB in your computer. It will create several files. Okay, and it's actually doing that because then you can easily parallelize the execution. You can move the index around. When you are finished with the information, the only thing you are doing is connecting to the simple example of performance in order to speed up the process. Of course, you can load some of these parts of the index in memory, perhaps the part that you use the most or the path that you have used. But how do you want to have your memory? That goes beyond that, which is more into the realm of computer science and optimization. But it's not that you need to sit down and define this. Database is something that the library is taking care of. Okay, so let's say move on to the next topic. Okay, so I listen to you. When we are using Google or any other website that you are using, you type something and you get the results. What is the first document that you check most of the time? The one that appears on the top. Why? Good point. Why are you going for this one? What makes it the most important? That's it. Okay, and that's something that you would like to be able to implement. We are in. Okay, I'm not only going to give you the documents in which this word appears, but I would like to order them according to other elements. So if you have a very common word, you don't want to give it too much importance. So this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is what this is whatThe phrase in question is rather common in the vocabulary of those who don't care about the repetition of words in a document, as the word is is has meaning. That is why it would be preferable to have repetitions related to the concept of stopping wars. Here, we were talking about repetitions, but what if I have several words in the query, such as rest, on, see, moderate, or house? I would prefer a document with all of these words. If the document does not contain the word, it appears hopeless. Because here, we are discussing a way of accounting for the length of the document. Therefore, the number of times a word is repeated in the document will depend on the length of the document, which is not the same as looking in a book. You need to be able to account for this. The more words you have in common, the better. This is more or less the idea of this. Of these, say Bmc. In fact, the idea is to somehow total these ideas so that you can make sense of when a document is going to be what I want to report by using the position on an inverted index to retrieve documents on the Bn. 25 metric to run these documents. You are going to create a system that, perhaps, is not the state of the art anymore, but it will be an excellent solution for whatever information retrieval problem. Okay. And this is the final model that we will implement on the phone system. In the following session, we will play around with different concepts, such as elastic, and we will end up implementing this in an index space of the Pm. 25, and hopefully this index will be able to give us a good idea. That is the idea that we are going to try to implement. Elasticsearch is a library that I recommend you to use because it is more powerful and rather simple to implement and scale to a very big cluster in production as well. Of course, if we can use any other libraries, not the one in the end. Everything is based on using Elasticsearch, which is an open source library that you can use. You also have a Weka, which is more or less the same thing. Elasticsearch is just a way of using solar in a more commercial way, in a simpler way. If you want to implement an information retrieval system, I recommend you to go for Elasticsearch and to check it out. But in any case, this is not the only one you need to use, and I don't have anything to do with the last itself. This is a commercial platform. If you prefer to use some other open source tool, but, as we will see, it is not so difficult to play around with an inverted index, and they seem to be some okay. Basically, they abstract you from all the details. The only thing that you need to do is to collect the documents, to define how you want to index the documents, how to create the documents, and index them, and to finally create a way of making queries to be able to retrieve documents. Okay, you are so good. Fine. So now I would like to close this by talking a little bit about how to evaluate an information retrieval system. Okay, precision and recall. Hopefully, you are familiar with these two metrics from machine learning. Right? Just a quick summary: precision is the fraction of the documents that match the user's information needs. So if I have 10 documents, and 7 of them are relevant, this will be 70% of what you gave me. Recall is the fraction of the relevant documents that are retrieved. So if I have 10 documents, and 7 of them are relevant, and I retrieve 5 of them, this will be 5 out of 7, which is 71%. So this is the idea of precision and recall.I'm with you on the documents, seven of them. So you are able to do three out of seven for the one document from the total of fourteen. So it'll be fifty percent, which is probably what you wanted. How much information have you given on that mission as a way of combining both? Okay, so my question to you is, what do you prefer, precision or recall in an information system? Any idea why? Okay, so let's think about both. You want to go up to two hundred? Okay, recall would guarantee that two hundred are up here, but at least two hundred two are up here. Precision, they are not going to appear, but from the ones that I'm giving you, I'm completely sure that at least the ten that I'm giving you are the first ones. What is the one that you prefer? I would prefer precision because it's unlikely that I don't care about the rest, I just want a good one. To me, the only right way is not to review the admin results. You don't care that there are eight million documents that are actually relevant to your query because you are not able to read them all. Well, you can, but what appears on the first page, the top three or five documents, they are relevant because these are the ones that you are going to check for. Okay, in this case, I mean most of the information or through a system, I could say that precision is something that you prefer. However, there are some of them that are using recall. Would you also get a not recall in my information system that we are implementing? In this case, for the database that you have, in order to retrieve information that could be relevant for a given meeting that we have. Well, perhaps you don't care so much about the information about your assistant, with really so me relevant documents. But what you want is to get this with all the documents that are related to the meeting. But you don't want to miss anything. So okay, you are giving me twenty-five results, ten of them are the ones that you want. But at least you are giving me all the information that is relevant. Okay, it depends. It actually depends on the situation which you are most of the times. We are getting a huge database of documents, thousands of them are going to be what they want. I don't have the time to read them all. The only thing that I have the time is to read the top two or three documents. So I want this two or three that you are giving to me to be 100% about what I'm asking about. I want the top ones that you are on the first page to be relevant because these are the ones that I'm going to check on. These are the ones that I'm going to use. It's the only way to get good results. They don't actually care. Okay, but at least the ones that appear are the ones that I want. For this case, precision is something that I prefer. However, there are some of them that are using recall. Would you also get a not recall in my information system that we are implementing? In this case, for the database that you have, in order to retrieve information that could be relevant for a given meeting that we have. Well, perhaps you don't care so much about the information about your assistant, with really so me relevant documents. But what you want is to get this with all the documents that are related to the meeting. But you don't want to miss anything. So okay, you are giving me twenty-five results, ten of them are the ones that you want. But at least you are giving me all the information that is relevant. Okay, it depends. It actually depends on the situation which you are most of the times. We are getting a huge database of documents, thousands of them are going to be what they want. I don't have the time to read them all. The only thing that I have the time is to read the top two or three documents. So I want this two or three that you are giving to me to be 100% about what I'm asking about. I want the top ones that you are on the first page to be relevant because these are the ones that I'm going to check on. These are the ones that I'm going to use. It's the only way to get good results. So, in this case, I mean most of the time, we are getting a huge database of documents, and we don't have the time to read them all. The only thing that we have the time is to read the top two or three documents. So we want this two or three that you are giving to us to be 100% about what we are asking about. We want the top ones that you are on the first page to be relevant because these are the ones that we are going to check on. These are the ones that we are going to use. It's the only way to get good results. So, in this case, I mean most of the time, we are using two standard metrics, precision and recall, because, as you can see, they are able to balance quite well the performance of the system. You could find our precision by the precision is 100%, but perhaps I would like to know a little bit more about that. I would like to see another two or three if I start out. We call it rather simple to maximize recall, you just have to search the database, but all the web, all the internet, right? For sure, over that, there are many other precision and recall metrics that we use in typical information systems. Then I'm asking about accuracy. Why is that another limitation of accuracy that you know about that could affect this problem? 99.9% of the information of the one is going to be relevant. So basically I can tell you, okay, there is no further information. I'm going to be 99.9% sure that the one that I'm giving to you is the one that you are looking for.99% accurate, right? Even if your accuracy is high, it may not be an unbalanced problem. We can talk about how to maximize a given metric or the meeting I have before. Accuracy is not something we use here; it's a trade-off we call. We can read more about this in what I want to comment on. Imagine you have two results: this is okay, this is not. The precision and recall for both systems are the same. You would prefer the ones at the top positions because they are the ones you are going to review. We will try to measure this with a metric we call. This metric takes into account not only the actual number of relevant results, but also their position. We can also use other metrics such as the F-measure, and we can do some kind of A/B testing. The most important one is to integrate the user in the process, asking them if they like or don't like the results. This is the most difficult, but the best way to improve the effectiveness of the information system.Of course, most users won't have time to look through documents and let you know if they've seen them once or more. This is called the one month feedback. However, the user is not explicitly telling me that, but they are doing so implicitly. For example, on Amazon I'm searching for some sneakers and I get four results. I check two of them and end up buying one. This means that if one of the first two was better, I didn't know it. Or on Google, I'm giving you ten social restaurants and you review three of them, don't like four, and like one. Or you enter into updates and you just bounce after a couple of seconds. So the result is not quite clear. Or you'll refine your queries. So for none of the documents, there are some ways of trying to understand in a multiplicative way. What are they looking at? What are the results they use to survive? Around 10% of an engine for the one is what they are called with the expansion, which is basically trying to understand semantics. Nowadays, thanks to mobile, we are able to do that. I recommend you to read the post on the Google Blog about how they are using it to improve information. Something else that is relevant is to try to account for the importance of the documents. This is something we did quite well with the page one calculation, trying to account for the importance of each document and measure how on the one they are given, how many common things they have. For example, in a corporate database, every document written by the CEO is more important than one written by me. Or within a meeting, any document which is an actual technical review, documenting processes and investments, is more important than one I'm writing myself. This light represents this concept. So it always takes into account what they are trying to measure. Finally, imagine a given document that you are not well known or referred to by many others. One of them is my personal blog. No one is linked to my personal blog because no one knows me. So, as I don't have an incoming link, the content of this document I have written is not going to be positioned as high in ranking as this one. Why? Because this one is referred by many others. So you can consider this as a way of expressing the importance of this document. If there is a person in the social network with many friends, it is because they are popular. So perhaps this is something you want to use to measure the elements of importance. If I am alone with no friends, we are trying to take it back to the same idea. But here, with the links, the importance of the number of links in common means that you have. Of course, there are some ways of trying to actively look for that and trying to remove these web pages, but in any case it is a way of measuring importance. But of course it is not the only one.What do you want to do in terms of measuring the importance of your given link? Is there something that you can include into your information control system to revive your talk? Finally, are there any additional documents that could be relevant to you? I'd love to have a few works on the Forum, but before that, is there anything else that needs to be clarified? Is there any concept you need me to review? Is there any final question? The first one was about the text conference, right? And I'm basically been there for more than 50 years. Now, they say they have zoomar eyes and have recorded different efforts for the information. They will have one of a confidence. So, basically, everything that you want to know in terms of research and practical application of information quickly, what is going to be over there, I do recommend you to check over that. The connection is just a well-known standard, the database that we use for information for 3. Well, that perhaps. You can use it for your experiments in order to do what we call a cross-lam, which information they also have a conference. They also have some tasks, some experimental data sets on something, some tasking you can participate in that you can use for your experimentation. So it will be the analogy of it, and it takes a lot of confidence on the States. But here they will be unlimited on. Okay. And this is basically what you can find on the Okay. So let me explain. For the practices here, what we are going to do is we are going to leverage some elastic search in order to implement that simple information. So the first thing that you need to do is to install elastic and all the link I have on the notebook. I suggest you install elastic search. So it's very simplistic. You just download the C file with elastic search, and you compress the same file. You will have a folder, and in this folder you will have an executable to execute elastic search. Okay, something that I have expressed on the but if you have any doubt, yes, let me know. So, long story short, download this, install elastic search on your computer, and execute the server. They give him binary to have the elastic set of it up and running. Okay. So it's a server that you will start, and you will have a server working under a port of your computer, and you can start making requests to this setup. Okay, in this set of like you will store the indexes with the store information, and it will be the one I am going to meet. The partners in the real world. Of course, this settlement will not be on your laptop. You have must be of computers, and you will have some dedicated servers, and you in which you would install that for the practice you are. So I do not recommend you to execute this quote on Google Collap, because then you would need to configure the and go to a platform doing that. It's better to be executed into your back. Okay. So you just download the notebook repository in the normal repository you have the link to the local, to the No. No plastic search. Install elastic search on your computer and execute the server over there. and then you will have the server into a port heating for connections. But once you have the server up and running, you can the the index in the documents on a. Okay. the we is on you. Can you need to connect to a rest? Api in order to avoid you, use this bargain of connect to this breast. Api. There are some and properties and different product now, because I'm going to use our oper on Python. so you will be able to connect this. I will be a python call. Okay. As you will see, the code is not a rather straightforward, so I'm giving you some some functions for something you do with that. And, of course, thinking about implementing and for your application.This is not the way your issues will interact with the system. They will not have access to an old book where they will type some iPhone code. Of course, we will create an application on top of that, and so on, but this is out of the scope of the practice. So, I'm going to give you some functions in order to have you read the indexing information, evaluate the methods, and so on. Don't worry about that. We are going to create a simple index and do simple ways to the index. One important part for creating the index is to be able to define the configuration of it. In this configuration, you will define what to do with the stop, how you want to organize the data, if you want to deal with English, Spanish, or any other language. Okay, so this is what you define. The configuration is in the form of a file, but we will use Python. So, the syntax is not very straightforward, but I'm giving you some examples. The only thing you need to do with all these examples is to monitor the elastic search documentation. Finally, you will need to evaluate the different systems that have been created. I have provided you with a list of functions in order to evaluate the model. Right? Don't worry. The only thing you need to do is to call the final function called evaluate. This function will call the recall. So, don't worry, because at some point you will see the function. Don't try to think about it in detail. The only thing they are doing is computing this procedure on the recall, and so on. Okay, don't get discouraged at some point. But a little bit goes into this practice. It's completely normal. Yes, and let me know. I'm here. I just tried to make it as easy as possible, and in any case I will be posting the solutions every day. So, if you don't understand the implementation of the practice, you can just wait to see my solution and work with it, change things, and implement. Okay, it's completely fine. That is going to be the practice. So, if you don't have anything else to comment, I guess I'll see you next Monday for the practice. |