|
--- |
|
inference: false |
|
language: sv |
|
license: |
|
- cc-by-sa-3.0 |
|
- gfdl |
|
library_name: txtai |
|
tags: |
|
- sentence-similarity |
|
datasets: |
|
- burgerbee/wikipedia-sv-20240220 |
|
--- |
|
|
|
# Wikipedia txtai embeddings index |
|
|
|
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/). |
|
|
|
This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220). Only the first two paragraph from each article is included. |
|
|
|
It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used |
|
to only match commonly visited pages. |
|
|
|
txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model. |
|
|
|
## Example |
|
|
|
```python |
|
from txtai.embeddings import Embeddings |
|
import json |
|
|
|
# Load the index from the HF Hub |
|
embeddings = Embeddings() |
|
embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-sv-wikipedia") |
|
|
|
# Run a search |
|
for x in embeddings.search("I vilken stad ligger Liseberg?", 1): |
|
print(json.dumps(x, indent=2)) |
|
|
|
# Run a search and filter on popular results (page views). |
|
for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('I vilken stad ligger Liseberg?') AND percentile >= 0.99", 1): |
|
print(json.dumps(x, indent=2)) |
|
``` |
|
|
|
## Use Cases |
|
|
|
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install. |
|
|
|
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions. |
|
|
|
See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model. |
|
|
|
# Source |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json |
|
|
|
https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-02/pageviews-202402-user.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream1.xml-p1p153415.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream2.xml-p153416p666977.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream3.xml-p666978p1690769.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p1690770p3190769.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p3190770p3794371.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p3794372p5294371.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p5294372p6319736.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p6319737p7819736.bz2 |
|
|
|
https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p7819737p8827284.bz2 |
|
|