burgerbee
/

txtai-sv-wikipedia

Sentence Similarity

Model card Files Files and versions Community

txtai-sv-wikipedia / README.md

burgerbee's picture

Update README.md

b519a78 verified 11 months ago

|

2.94 kB

	---
	inference: false
	language: sv
	license:
	- cc-by-sa-3.0
	- gfdl
	library_name: txtai
	tags:
	- sentence-similarity
	datasets:
	- NeuML/wikipedia-20240101
	---

	# Wikipedia txtai embeddings index

	This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://en.wikipedia.org/).

	This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240220). Only the first two paragraph from each article is included.

	It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
	to only match commonly visited pages.

	txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.

	## Example

	```python
	from txtai.embeddings import Embeddings

	# Load the index from the HF Hub
	embeddings = Embeddings()
	embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

	# Run a search
	embeddings.search("Roman Empire")

	# Run a search matching only the Top 1% of articles
	embeddings.search("""
	SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND
	percentile >= 0.99
	""")
	```

	# Source

	https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json

	https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-02/pageviews-202402-user.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream1.xml-p1p153415.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream2.xml-p153416p666977.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream3.xml-p666978p1690769.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p1690770p3190769.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p3190770p3794371.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p3794372p5294371.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p5294372p6319736.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p6319737p7819736.bz2

	https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p7819737p8827284.bz2

	## Use Cases

	An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.

	The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.

	See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.