|
--- |
|
inference: false |
|
language: sv |
|
license: |
|
- cc-by-sa-3.0 |
|
- gfdl |
|
library_name: txtai |
|
tags: |
|
- sentence-similarity |
|
datasets: |
|
- NeuML/wikipedia-20240101 |
|
--- |
|
|
|
# Wikipedia txtai embeddings index |
|
|
|
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://en.wikipedia.org/). |
|
|
|
This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240220). Only the first two paragraph from each article is included. |
|
|
|
It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used |
|
to only match commonly visited pages. |
|
|
|
txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model. |
|
|
|
## Example |
|
|
|
```python |
|
from txtai.embeddings import Embeddings |
|
|
|
# Load the index from the HF Hub |
|
embeddings = Embeddings() |
|
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia") |
|
|
|
# Run a search |
|
embeddings.search("Roman Empire") |
|
|
|
# Run a search matching only the Top 1% of articles |
|
embeddings.search(""" |
|
SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND |
|
percentile >= 0.99 |
|
""") |
|
``` |
|
|
|
# Source |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json |
|
|
|
https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-02/pageviews-202402-user.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream1.xml-p1p153415.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream2.xml-p153416p666977.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream3.xml-p666978p1690769.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p1690770p3190769.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p3190770p3794371.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p3794372p5294371.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p5294372p6319736.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p6319737p7819736.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p7819737p8827284.bz2 |
|
|
|
## Use Cases |
|
|
|
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install. |
|
|
|
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions. |
|
|
|
See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model. |
|
|