KennethTM commited on
Commit
af04a3a
·
verified ·
1 Parent(s): 814daff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -17
README.md CHANGED
@@ -6,33 +6,23 @@ tags:
6
  - sentence-similarity
7
  license: mit
8
  datasets:
9
- - sentence-transformers/embedding-training-data
10
- - clips/mfaq
11
  - squad
12
  - eli5
 
13
  language:
14
  - da
15
  library_name: sentence-transformers
16
  ---
17
 
18
- **Work in progress**
19
-
20
  # MiniLM-L6-danish-encoder
21
 
22
  This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.
23
 
24
- The maximum sequence length is 128 tokens.
25
-
26
- The model was not pre-trained from scratch but adapted from the English version with a [tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish) trained on Danish text.
27
 
28
- When using the model to retrieve relevant passages for a given query - "Query: " should be added to the query:
29
 
30
- ```python
31
- query = "Kan man cykle på en vej?"
32
- query_template = f"Query: {query}"
33
-
34
- #query_template kan now be embedded and similarity compared to other passages
35
- ```
36
 
37
  # Usage (Sentence-Transformers)
38
 
@@ -45,7 +35,7 @@ Then you can use the model like this:
45
 
46
  ```python
47
  from sentence_transformers import SentenceTransformer
48
- sentences = ["Query: Kører der cykler på vejen?", "En mand løber på vejen.", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
49
 
50
  model = SentenceTransformer('KennethTM/MiniLM-L6-danish-encoder')
51
  embeddings = model.encode(sentences)
@@ -66,7 +56,7 @@ def mean_pooling(model_output, attention_mask):
66
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
67
 
68
  # Sentences we want sentence embeddings for
69
- sentences = ["Query: Kører der cykler på vejen?", "En mand løber på vejen.", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
70
 
71
  # Load model from HuggingFace Hub
72
  tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
@@ -87,4 +77,4 @@ sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
87
 
88
  print("Sentence embeddings:")
89
  print(sentence_embeddings)
90
- ```
 
6
  - sentence-similarity
7
  license: mit
8
  datasets:
 
 
9
  - squad
10
  - eli5
11
+ - sentence-transformers/embedding-training-data
12
  language:
13
  - da
14
  library_name: sentence-transformers
15
  ---
16
 
 
 
17
  # MiniLM-L6-danish-encoder
18
 
19
  This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.
20
 
21
+ The maximum sequence length is 512 tokens.
 
 
22
 
23
+ The model was not pre-trained from scratch but adapted from the English version with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish).
24
 
25
+ Trained on ELI5 and SQUAD data machine translated from English to Danish.
 
 
 
 
 
26
 
27
  # Usage (Sentence-Transformers)
28
 
 
35
 
36
  ```python
37
  from sentence_transformers import SentenceTransformer
38
+ sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
39
 
40
  model = SentenceTransformer('KennethTM/MiniLM-L6-danish-encoder')
41
  embeddings = model.encode(sentences)
 
56
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
57
 
58
  # Sentences we want sentence embeddings for
59
+ sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
60
 
61
  # Load model from HuggingFace Hub
62
  tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
 
77
 
78
  print("Sentence embeddings:")
79
  print(sentence_embeddings)
80
+ ```