--- title: LegalKit Retrieval emoji: 📖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.4.0 app_file: app.py pinned: true header: mini license: apache-2.0 short_description: A binary Search with Scalar Rescoring through legal codes --- # LegalKit Retrieval, a binary Search with Scalar (int8) Rescoring through French legal codes This space showcases the [tsdae-lemone-mbert-base](https://huggingface.co/louisbrulenaudet/tsdae-lemone-mbert-base) model by Louis Brulé Naudet, a sentence embedding model based on BERT fitted using Transformer-based Sequential Denoising Auto-Encoder for unsupervised sentence embedding learning with one objective : french legal domain adaptation. This process is designed to be memory efficient and fast, with the binary index being small enough to fit in memory and the int8 index being loaded as a view to save memory. In total, this process requires keeping 1) the model in memory, 2) the binary index in memory, and 3) the int8 index on disk. Additionally, the binary index is much faster (up to 32x) to search than the float32 index, while the rescoring is also extremely efficient. In conclusion, this process allows for fast, scalable, cheap, and memory-efficient retrieval. Notes: - The SentenceTransformer model currently in use is in beta and may not be suitable for direct use in production. ## Dependencies ### Libraries Used: - **Accelerate** (v0.29.1): A Python library for high-performance computing, enabling faster execution of computational tasks. - **Faiss-GPU** (v1.7.2): A GPU-accelerated library for efficient similarity search and clustering of dense vectors, essential for high-dimensional data analysis. - **Gradio** (v4.25.0): An intuitive library for creating customizable UI components around machine learning models, simplifying model deployment and interaction. - **Polars** (v0.20.18): A blazing-fast DataFrame library for Rust, providing efficient data manipulation capabilities for large datasets. - **Sentence-Transformers** (v2.6.1): A versatile library for generating sentence embeddings, facilitating various natural language processing tasks such as semantic similarity and text classification. - **Spaces** (v0.25.0): A utility library designed to optimize GPU resource management, enhancing efficiency and scalability in GPU-based computing environments. - **Usearch** (v2.10.5): A powerful library for performing fast approximate nearest neighbor search, crucial for tasks like recommendation systems and data clustering. ### Installation Guide To install all the dependencies, you can use the following command: ```shell pip3 install accelerate faiss-gpu gradio polars sentence-transformers spaces usearch ``` Note: Ensure you have Python installed on your system before proceeding with the installation of these libraries. ## Citing this project If you use this code in your research, please use the following BibTeX entry. ```BibTeX @misc{louisbrulenaudet2024, author = {Louis Brulé Naudet}, title = {LegalKit Retrieval, a binary Search with Scalar (int8) Rescoring through French legal codes}, howpublished = {\url{https://huggingface.co/spaces/louisbrulenaudet/legalkit-retrieval}}, year = {2024} } ``` ## Feedback If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).