Update README.md
Browse files
README.md
CHANGED
@@ -14,8 +14,11 @@ datasets:
|
|
14 |
# Wikipedia txtai embeddings index
|
15 |
|
16 |
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
|
|
|
|
|
17 |
|
18 |
-
This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220).
|
|
|
19 |
|
20 |
It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
|
21 |
to only match commonly visited pages.
|
@@ -41,14 +44,6 @@ for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE
|
|
41 |
print(json.dumps(x, indent=2))
|
42 |
```
|
43 |
|
44 |
-
## Use Cases
|
45 |
-
|
46 |
-
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
|
47 |
-
|
48 |
-
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.
|
49 |
-
|
50 |
-
See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.
|
51 |
-
|
52 |
# Source
|
53 |
|
54 |
https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json
|
|
|
14 |
# Wikipedia txtai embeddings index
|
15 |
|
16 |
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
|
17 |
+
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
|
18 |
+
An embeddings index generated by txtai is a fully encapsulated index format. It DOESN'T require a database server.
|
19 |
|
20 |
+
This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220).
|
21 |
+
Only the first two paragraph from each article is included. The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG).
|
22 |
|
23 |
It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
|
24 |
to only match commonly visited pages.
|
|
|
44 |
print(json.dumps(x, indent=2))
|
45 |
```
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
# Source
|
48 |
|
49 |
https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json
|