Spaces:
Sleeping
Sleeping
Course Search System: My Implementation Journey | |
Data Gathering | |
I started by figuring how to scrape data from Analytics Vidhya website. I used bs4 for web scraping to extract information like course titles, descriptions, prerequisites, and more. Once I coded the fetching code, I made sure it is getting scraped properly and stored in a structured format to be used later on. | |
Choosing the Right Tools | |
For processing the text data, I selected a powerful language model "all-MiniLM-L6-v2" to encode the relevant data in vector representation. This model converted the text into numerical representations that computers could easily process. | |
The all-MiniLM-L6-v2 model is an efficient choice for course search as it provides high-quality 384-dimensional embeddings while being lightweight (80MB), making it ideal for semantic similarity tasks with good performance-to-resource ratio. | |
To efficiently store and search these vectors, I used a vector database called FAISS. It's designed for handling large datasets of vectors and quickly finding the most similar ones. | |
I wouldve preferred Pinecone as it is a cloud based database which is efficient as well as very clean and interpretable. But due to some new updates in pinecone servers it was having some issues, so keeping in mind the deadline I instead went for the 2nd best option that is FAISS. | |
FAISS is a preferred vector database for local storage as it is very fast and lightweight as well as fairly simple to work with when we are working with simple tasks. | |
Building the System | |
I designed my system to be flexible and scalable. Here's a breakdown of its key components: | |
Data Ingestion: This part collects and prepares the course data. | |
Embedding: The language model processes the course descriptions and creates vectors. | |
Vector Database: This stores the vectors for efficient searching. | |
Search API: This allows users to query the system and get relevant results. | |
User Interface: This is the front-end where users can interact with the system. | |
I deployed the system on huggingface spaces to ensure it's reliable and can handle increasing user demand. Each component runs in its own container, making it easy to manage and update. | |
Challenges and Solutions | |
Data Quality: Ensuring data consistency and accuracy was a big challenge. I addressed this by carefully cleaning and standardizing the data while scraping. Since scraping text accurately is a bit tedious task, it took me some time to get it correct. | |
Model Performance: Choosing the right language model was crucial. I experimented with different models to get the best results. Ultimately keeping in mind the use case, i figured it would be best to keep it lightweight and quick for quick search results. | |
User Experience: I focused on making the system user-friendly. I conducted many tests and made improvements to the interface and search algorithm. | |
Overall, this project was a great learning experience. I'm proud of what I've accomplished in the deadline and believe I couldve improved it given some more time. I am looking forward to discuss this project in an interview with you. | |