josh-sematic's picture
Update README.md
67e17ac verified
|
raw
history blame
1.34 kB
metadata
title: Fineweb-edu-fortified Semantic Search Demo
emoji: πŸ“š
sdk: gradio
sdk_version: 4.41.0
app_file: app.py
pinned: false
datasets:
  - airtrain-ai/fineweb-edu-fortified
  - HuggingFaceFW/fineweb-edu
models:
  - TaylorAI/bge-micro
license: apache-2.0

Semantic Search on Fineweb-edu-fortified sample

This performs semantic search on one crawl ({{CRAWL_DUMP}}) from Fineweb-edu-fortified. It is intended to illustrate the contents of fineweb-edu and fineweb-edu-fortified. To explore Fineweb-edu-fortified further, you can view automatic clustering, embedding projections, and more for a 500k row sample using this Airtrain dashboard.

The embeddings are the ones present in the dataset itself, and the same embedding model is used to embed your search phrase. The search is performed using the 15 rows with the closest embedding vectors to the embedding of the search phrase.

The search data is lazily loaded, so shortly after the space is launched it may not yet have the full corpus of text from that crawl available for search. Refer to 'Rows searched' to see how many rows were searched across to retrieve the results.