--- inference: false language: en license: - cc-by-sa-3.0 - gfdl library_name: txtai tags: - sentence-similarity datasets: - burgerbee/wikipedia-en-20241020 --- # Wikipedia txtai embeddings index This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddings + 25GB documents) for the [english edition of Wikipedia](https://en.wikipedia.org/). Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server. This index is built from the [Wikipedia october 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020). The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used to only match commonly visited pages. txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this. ## Example code ```python from txtai.embeddings import Embeddings import json # Load the index from the HF Hub embeddings = Embeddings() embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-en-wikipedia") # Run a search for x in embeddings.search("Bob Dylans second album", 1): print(x["text"]) # Run a search and filter on popular results (page views). for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('Where in the World Is Carmen Sandiego?') AND percentile >= 0.99", 1): print(json.dumps(x, indent=2)) ``` ## Example output ```json The Freewheelin' Bob Dylan is the second studio album by American singer-songwriter Bob Dylan, released on May 27, 1963 by Columbia Records... (full article) { "id": "Where in the World Is Carmen Sandiego? (game show)", "text": "Where in the World Is Carmen Sandiego? is an American half-hour children's television game show based on... (full article) "score": 0.8537465929985046, "percentile": 0.996002961084341 } ``` ## Data source https://dumps.wikimedia.org/enwiki/ https://dumps.wikimedia.org/other/pageview_complete/ https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020