|
--- |
|
inference: false |
|
language: en |
|
license: |
|
- cc-by-sa-3.0 |
|
- gfdl |
|
library_name: txtai |
|
tags: |
|
- sentence-similarity |
|
datasets: |
|
- NeuML/wikipedia-20240901 |
|
--- |
|
|
|
# Wikipedia txtai embeddings index |
|
|
|
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/). |
|
|
|
This index is built from the [Wikipedia September 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240901). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article. |
|
|
|
It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used |
|
to only match commonly visited pages. |
|
|
|
txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model. |
|
|
|
## Example |
|
|
|
See the example below. This index requires txtai >= 7.4. |
|
|
|
```python |
|
from txtai.embeddings import Embeddings |
|
|
|
# Load the index from the HF Hub |
|
embeddings = Embeddings() |
|
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia") |
|
|
|
# Run a search |
|
embeddings.search("Roman Empire") |
|
|
|
# Run a search matching only the Top 1% of articles |
|
embeddings.search(""" |
|
SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND |
|
percentile >= 0.99 |
|
""") |
|
``` |
|
|
|
## Use Cases |
|
|
|
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install. |
|
|
|
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions. |
|
|
|
See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model. |
|
|
|
## Evaluation Results |
|
|
|
Performance was evaluated using the [NDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) score with a [custom question-answer evaluation set](https://github.com/neuml/txtchat/tree/master/datasets/wikipedia). Results are shown below. |
|
|
|
| Model | NDCG@10 | MAP@10 | |
|
| ---------------------------------------------------------- | ---------- | --------- | |
|
| [bge-base-en-v1.5](https://hf.co/BAAI/bge-large-en-v1.5) | 0.6320 | 0.5485 | |
|
| [**e5-base**](https://hf.co/intfloat/e5-base) | **0.7021** | **0.6517** | |
|
| [gte-base](https://hf.co/thenlper/gte-base) | 0.6775 | 0.6350 | |
|
|
|
`e5-base` is the best performing model for the evaluation set. This highlights the importance of testing models as `e5-base` is far from the leading model on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Benchmark datasets are only a guide. |
|
|
|
## Build the index |
|
|
|
The following steps show how to build this index. These scripts are using the latest data available as of 2024-09-01, update as appropriate. |
|
|
|
- Install required build dependencies |
|
```bash |
|
pip install txtchat mwparserfromhell datasets |
|
``` |
|
|
|
- Download and build pageviews database |
|
```bash |
|
mkdir -p pageviews/data |
|
wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-08/pageviews-202408-user.bz2 |
|
python -m txtchat.data.wikipedia.views -p en.wikipedia -v pageviews |
|
``` |
|
|
|
- Build Wikipedia dataset |
|
|
|
```python |
|
from datasets import load_dataset |
|
|
|
# Data dump date from https://dumps.wikimedia.org/enwiki/ |
|
date = "20240901" |
|
|
|
# Build and save dataset |
|
ds = load_dataset("neuml/wikipedia", language="en", date=date) |
|
ds.save_to_disk(f"wikipedia-{date}") |
|
``` |
|
|
|
- Build txtai-wikipedia index |
|
```bash |
|
python -m txtchat.data.wikipedia.index \ |
|
-d wikipedia-20240901 \ |
|
-o txtai-wikipedia \ |
|
-v pageviews/pageviews.sqlite |
|
``` |
|
|