|
--- |
|
inference: false |
|
language: en |
|
license: |
|
- cc-by-sa-3.0 |
|
- gfdl |
|
library_name: txtai |
|
tags: |
|
- sentence-similarity |
|
datasets: |
|
- burgerbee/wikipedia-en-20240320 |
|
--- |
|
# Wikipedia txtai embeddings index |
|
This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddings + 25GB documents) for the [english edition of Wikipedia](https://en.wikipedia.org/). |
|
|
|
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. |
|
An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server. |
|
|
|
This index is built from the [Wikipedia march 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-en-20240320). |
|
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used |
|
to only match commonly visited pages. |
|
|
|
txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this. |
|
## Example code |
|
```python |
|
from txtai.embeddings import Embeddings |
|
import json |
|
|
|
# Load the index from the HF Hub |
|
embeddings = Embeddings() |
|
embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-en-wikipedia") |
|
|
|
# Run a search |
|
for x in embeddings.search("Bob Dylans second album", 1): |
|
print(x["text"]) |
|
|
|
# Run a search and filter on popular results (page views). |
|
for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('Where in the World Is Carmen Sandiego?') AND percentile >= 0.99", 1): |
|
print(json.dumps(x, indent=2)) |
|
``` |
|
## Example output |
|
```json |
|
The Freewheelin' Bob Dylan is the second studio album by American singer-songwriter Bob Dylan, released on May 27, 1963 by Columbia Records... (full article) |
|
|
|
{ |
|
"id": "Where in the World Is Carmen Sandiego? (game show)", |
|
"text": "Where in the World Is Carmen Sandiego? is an American half-hour children's television game show based on... (full article) |
|
"score": 0.8537465929985046, |
|
"percentile": 0.996002961084341 |
|
} |
|
``` |
|
## Data source |
|
https://dumps.wikimedia.org/enwiki/ |
|
|
|
https://dumps.wikimedia.org/other/pageview_complete/ |
|
|
|
https://huggingface.co/datasets/burgerbee/wikipedia-en-20240320 |