--- title: Multilingual Search Quora Similar Questions emoji: 🔍🌐💬 colorFrom: indigo colorTo: red sdk: gradio sdk_version: 4.41.0 app_file: app.py pinned: false license: mit --- # Semantic Search App for Quora Dataset This application enables semantic search across the Quora question dataset in multiple languages using advanced machine learning techniques. Unlike traditional keyword search, semantic search considers the meaning of the query to generate more relevant results. ## Features - **Multilingual Semantic Search**: The app allows users to search for similar questions in different languages. - **Model**: Uses the `paraphrase-multilingual-mpnet-base-v2` model from Sentence Transformers to generate embeddings for the queries and questions. - **Vector Database**: Embeddings are stored and retrieved from the Pinecone Vector Database, allowing for efficient similarity searches. - **Cosine Similarity**: Search results are ranked by cosine similarity scores, from highest to lowest, showing how closely related each question is to the query. - **Dynamic Query**: Users can adjust the number of similar questions retrieved using a slider. ## How It Works 1. **Embedding Generation**: The app uses the `paraphrase-multilingual-mpnet-base-v2` model to encode both the query and the questions from the Quora dataset into 768-dimensional embeddings. 2. **Search Query**: When a user inputs a search query, the app generates an embedding for the query. 3. **Similarity Search**: The query embedding is then compared with the stored question embeddings in Pinecone using cosine similarity. The top K most similar questions are retrieved and displayed. 4. **Results Display**: The results are shown in a table, with each row displaying the question ID, the question text, and the similarity score. ## Usage 1. **Input Your Query**: Enter your search query in the text box provided. 2. **Adjust Number of Results**: Use the slider to select how many similar questions you want to retrieve (between 3 and 10). 3. **View Results**: After clicking the "Search" button, the app will display the most similar questions along with their similarity scores in a table. ## Technology Stack - **Gradio**: Used to build the interactive user interface. - **Pinecone**: Vector database for storing and querying embeddings. - **Sentence Transformers**: Used for generating embeddings with the `paraphrase-multilingual-mpnet-base-v2` model. - **Pandas**: Used for handling and displaying the results in a tabular format.