NghiemAbe commited on
Commit
58d0c99
·
verified ·
1 Parent(s): b5f2623

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -101
README.md CHANGED
@@ -20,28 +20,13 @@ widget:
20
  học Bách khoa Hà Nội giúp họ thích nghi nhanh chóng.
21
  - Hà Nội có khí hậu mát mẻ vào mùa thu.
22
  - Các món ăn ở Hà Nội rất ngon và đa dạng.
23
- license: apache-2.0
24
  ---
25
 
26
- # bkai-foundation-models/vietnamese-bi-encoder
27
 
28
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
29
 
30
- We train the model on a merged training dataset that consists of:
31
- - MS Macro (translated into Vietnamese)
32
- - SQuAD v2 (translated into Vietnamese)
33
- - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
34
-
35
- We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
36
-
37
- Here are the results on the remaining 20% of the training set from the Legal Text Retrieval Zalo 2021 challenge:
38
-
39
- | Pretrained Model | Trained Datasets | Acc@1 | Acc@10 | Acc@100 | Pre@10 | MRR@10 |
40
- |-------------------------------|---------------------------------------|:------------:|:-------------:|:--------------:|:-------------:|:-------------:|
41
- | [Vietnamese-SBERT](https://huggingface.co/keepitreal/vietnamese-sbert) | - | 32.34 | 52.97 | 89.84 | 7.05 | 45.30 |
42
- | PhoBERT-base-v2 | MSMACRO | 47.81 | 77.19 | 92.34 | 7.72 | 58.37 |
43
- | PhoBERT-base-v2 | MSMACRO + SQuADv2.0 + 80% Zalo | 73.28 | 93.59 | 98.85 | 9.36 | 80.73 |
44
-
45
 
46
  <!--- Describe your model here -->
47
 
@@ -61,94 +46,11 @@ from sentence_transformers import SentenceTransformer
61
  # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
62
  sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
63
 
64
- model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
65
  embeddings = model.encode(sentences)
66
  print(embeddings)
67
  ```
68
 
69
-
70
- ## Usage (Widget HuggingFace)
71
- The widget use custom pipeline on top of the default pipeline by adding additional word segmenter before PhobertTokenizer. So you do not need to segment words before using the API:
72
-
73
- An example could be seen in Hosted inference API.
74
-
75
-
76
- ## Usage (HuggingFace Transformers)
77
-
78
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
79
-
80
- ```python
81
- from transformers import AutoTokenizer, AutoModel
82
- import torch
83
-
84
-
85
- #Mean Pooling - Take attention mask into account for correct averaging
86
- def mean_pooling(model_output, attention_mask):
87
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
88
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
89
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
90
-
91
-
92
- # Sentences we want sentence embeddings, we could use pyvi, underthesea, RDRSegment to segment words
93
- sentences = ['Cô ấy là một người vui_tính .', 'Cô ấy cười nói suốt cả ngày .']
94
-
95
- # Load model from HuggingFace Hub
96
- tokenizer = AutoTokenizer.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
97
- model = AutoModel.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
98
-
99
- # Tokenize sentences
100
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
101
-
102
- # Compute token embeddings
103
- with torch.no_grad():
104
- model_output = model(**encoded_input)
105
-
106
- # Perform pooling. In this case, mean pooling.
107
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
108
-
109
- print("Sentence embeddings:")
110
- print(sentence_embeddings)
111
- ```
112
-
113
- ## Training
114
-
115
- The model was trained with the parameters:
116
-
117
- **DataLoader**:
118
-
119
- `torch.utils.data.dataloader.DataLoader` of length 17584 with parameters:
120
-
121
- ```
122
- {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
123
- ```
124
-
125
- **Loss**:
126
-
127
- `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
128
-
129
- ```
130
- {'scale': 20.0, 'similarity_fct': 'cos_sim'}
131
- ```
132
-
133
- Parameters of the fit()-Method:
134
-
135
- ```
136
- {
137
- "epochs": 15,
138
- "evaluation_steps": 0,
139
- "evaluator": "NoneType",
140
- "max_grad_norm": 1,
141
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
142
- "optimizer_params": {
143
- "lr": 2e-05
144
- },
145
- "scheduler": "WarmupLinear",
146
- "steps_per_epoch": null,
147
- "warmup_steps": 1000,
148
- "weight_decay": 0.01
149
- }
150
- ```
151
-
152
  ## Full Model Architecture
153
 
154
  ```
 
20
  học Bách khoa Hà Nội giúp họ thích nghi nhanh chóng.
21
  - Hà Nội có khí hậu mát mẻ vào mùa thu.
22
  - Các món ăn ở Hà Nội rất ngon và đa dạng.
 
23
  ---
24
 
25
+ # NghiemAbe/sami-sbert-CT
26
 
27
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
28
 
29
+ I use pretrained model [bkai-foundation-models/vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) and train the model on SAMI dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  <!--- Describe your model here -->
32
 
 
46
  # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
47
  sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
48
 
49
+ model = SentenceTransformer('NghiemAbe/sami-sbert-CT')
50
  embeddings = model.encode(sentences)
51
  print(embeddings)
52
  ```
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ## Full Model Architecture
55
 
56
  ```