txtai-en-wikipedia / README.md
burgerbee's picture
Update README.md
e3e1ab4 verified
|
raw
history blame
2.35 kB
---
inference: false
language: en
license:
- cc-by-sa-3.0
- gfdl
library_name: txtai
tags:
- sentence-similarity
datasets:
- burgerbee/wikipedia-en-20240320
---
# Wikipedia txtai embeddings index
This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddings + 25GB documents) for the [english edition of Wikipedia](https://en.wikipedia.org/).
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
This index is built from the [Wikipedia march 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-en-20240320).
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
to only match commonly visited pages.
txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this.
## Example code
```python
from txtai.embeddings import Embeddings
import json
# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-en-wikipedia")
# Run a search
for x in embeddings.search("Bob Dylans second album", 1):
print(x["text"])
# Run a search and filter on popular results (page views).
for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('Where in the World Is Carmen Sandiego?') AND percentile >= 0.99", 1):
print(json.dumps(x, indent=2))
```
## Example output
```json
The Freewheelin' Bob Dylan is the second studio album by American singer-songwriter Bob Dylan, released on May 27, 1963 by Columbia Records... (full article)
{
"id": "Where in the World Is Carmen Sandiego? (game show)",
"text": "Where in the World Is Carmen Sandiego? is an American half-hour children's television game show based on... (full article)
"score": 0.8537465929985046,
"percentile": 0.996002961084341
}
```
## Data source
https://dumps.wikimedia.org/enwiki/
https://dumps.wikimedia.org/other/pageview_complete/
https://huggingface.co/datasets/burgerbee/wikipedia-en-20240320