burgerbee commited on
Commit
70eeb6d
·
verified ·
1 Parent(s): ead5e9e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -9
README.md CHANGED
@@ -14,8 +14,11 @@ datasets:
14
  # Wikipedia txtai embeddings index
15
 
16
  This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
 
 
17
 
18
- This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220). Only the first two paragraph from each article is included.
 
19
 
20
  It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
21
  to only match commonly visited pages.
@@ -41,14 +44,6 @@ for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE
41
  print(json.dumps(x, indent=2))
42
  ```
43
 
44
- ## Use Cases
45
-
46
- An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
47
-
48
- The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.
49
-
50
- See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.
51
-
52
  # Source
53
 
54
  https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json
 
14
  # Wikipedia txtai embeddings index
15
 
16
  This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
17
+ Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
18
+ An embeddings index generated by txtai is a fully encapsulated index format. It DOESN'T require a database server.
19
 
20
+ This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220).
21
+ Only the first two paragraph from each article is included. The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG).
22
 
23
  It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
24
  to only match commonly visited pages.
 
44
  print(json.dumps(x, indent=2))
45
  ```
46
 
 
 
 
 
 
 
 
 
47
  # Source
48
 
49
  https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json