KennethTM commited on
Commit
c8ba904
1 Parent(s): 3f91320

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ license: mit
8
+ datasets:
9
+ - squad
10
+ - eli5
11
+ - sentence-transformers/embedding-training-data
12
+ - KennethTM/gooaq_pairs_danish
13
+ - sentence-transformers/gooaq
14
+ - KennethTM/squad_pairs_danish
15
+ - KennethTM/eli5_question_answer_danish
16
+ language:
17
+ - da
18
+ library_name: sentence-transformers
19
+ ---
20
+
21
+ # Note
22
+
23
+ *This an updated version of [KennethTM/MiniLM-L6-danish-encoder](https://huggingface.co/KennethTM/MiniLM-L6-danish-encoder). This version is just trained on more data ([GooAQ dataset](https://huggingface.co/datasets/sentence-transformers/gooaq) translated to [Danish](https://huggingface.co/datasets/KennethTM/gooaq_pairs_danish)) and is otherwise the same*
24
+
25
+
26
+ # MiniLM-L6-danish-encoder
27
+
28
+ This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.
29
+
30
+ The maximum sequence length is 512 tokens.
31
+
32
+ The model was not pre-trained from scratch but adapted from the English version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish).
33
+
34
+ Trained on ELI5 and SQUAD data machine translated from English to Danish.
35
+
36
+ # Usage (Sentence-Transformers)
37
+
38
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
39
+
40
+ ```
41
+ pip install -U sentence-transformers
42
+ ```
43
+ Then you can use the model like this:
44
+
45
+ ```python
46
+ from sentence_transformers import SentenceTransformer
47
+ from sentence_transformers.util import cos_sim
48
+
49
+ # Given a query
50
+ query = ['Kører der cykler på vejen?']
51
+
52
+ # And two passages
53
+ passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.',
54
+ 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.']
55
+
56
+ # Compute embeddings
57
+ model = SentenceTransformer("KennethTM/MiniLM-L6-danish-encoder-v2")
58
+ query_embeddings = model.encode(query)
59
+ passage_embeddings = model.encode(passage)
60
+
61
+ # To find most relevant passage for the query (closer to 1 means more similar)
62
+ cosine_scores = cos_sim(query_embeddings, passage_embeddings)
63
+ print(cosine_scores)
64
+ ```
65
+ # Usage (HuggingFace Transformers)
66
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
67
+
68
+ ```python
69
+ from transformers import AutoTokenizer, AutoModel
70
+ import torch
71
+ import torch.nn.functional as F
72
+
73
+ #Mean Pooling - Take attention mask into account for correct averaging
74
+ def mean_pooling(model_output, attention_mask):
75
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
76
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
77
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
78
+
79
+ # Load model from HuggingFace Hub
80
+ tokenizer = AutoTokenizer.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2")
81
+ model = AutoModel.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2")
82
+
83
+ # Given a query
84
+ query = ['Kører der cykler på vejen?']
85
+
86
+ # And two passages
87
+ passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.',
88
+ 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.']
89
+
90
+ # Tokenize sentences
91
+ query_encoded = tokenizer(query, padding=True, truncation=True, return_tensors='pt')
92
+ passage_encoded = tokenizer(passage, padding=True, truncation=True, return_tensors='pt')
93
+
94
+ # Compute embeddings
95
+ with torch.no_grad():
96
+ query_features = model(**query_encoded)
97
+ passage_features = model(**passage_encoded)
98
+
99
+ # Perform pooling
100
+ query_embeddings = mean_pooling(query_features, query_encoded['attention_mask'])
101
+ passage_embeddings = mean_pooling(passage_features, passage_encoded['attention_mask'])
102
+
103
+ # To find most relevant passage for the query (closer to 1 means more similar)
104
+ cosine_scores = F.cosine_similarity(query_embeddings, passage_embeddings)
105
+ print(cosine_scores)
106
+ ```