NghiemAbe commited on
Commit
9cc8a7f
·
verified ·
1 Parent(s): 74d4966

Pushing model

Browse files

Model trained on SAMI dataset using CT loss

Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+ library_name: generic
9
+ language:
10
+ - vi
11
+ widget:
12
+ - source_sentence: Làm thế nào Đại học Bách khoa Hà Nội thu hút sinh viên quốc tế?
13
+ sentences:
14
+ - >-
15
+ Đại học Bách khoa Hà Nội đã phát triển các chương trình đào tạo bằng tiếng
16
+ Anh để làm cho việc học tại đây dễ dàng hơn cho sinh viên quốc tế.
17
+ - >-
18
+ Môi trường học tập đa dạng và sự hỗ trợ đầy đủ cho sinh viên quốc tế tại Đại
19
+ học Bách khoa Hà Nội giúp họ thích nghi nhanh chóng.
20
+ - Hà Nội có khí hậu mát mẻ vào mùa thu.
21
+ - Các món ăn ở Hà Nội rất ngon và đa dạng.
22
+ license: apache-2.0
23
+ ---
24
+
25
+ # bkai-foundation-models/vietnamese-bi-encoder
26
+
27
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
28
+
29
+ We train the model on a merged training dataset that consists of:
30
+ - MS Macro (translated into Vietnamese)
31
+ - SQuAD v2 (translated into Vietnamese)
32
+ - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
33
+
34
+ We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
35
+
36
+ Here are the results on the remaining 20% of the training set from the Legal Text Retrieval Zalo 2021 challenge:
37
+
38
+ | Pretrained Model | Trained Datasets | Acc@1 | Acc@10 | Acc@100 | Pre@10 | MRR@10 |
39
+ |-------------------------------|---------------------------------------|:------------:|:-------------:|:--------------:|:-------------:|:-------------:|
40
+ | [Vietnamese-SBERT](https://huggingface.co/keepitreal/vietnamese-sbert) | - | 32.34 | 52.97 | 89.84 | 7.05 | 45.30 |
41
+ | PhoBERT-base-v2 | MSMACRO | 47.81 | 77.19 | 92.34 | 7.72 | 58.37 |
42
+ | PhoBERT-base-v2 | MSMACRO + SQuADv2.0 + 80% Zalo | 73.28 | 93.59 | 98.85 | 9.36 | 80.73 |
43
+
44
+
45
+ <!--- Describe your model here -->
46
+
47
+ ## Usage (Sentence-Transformers)
48
+
49
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
50
+
51
+ ```
52
+ pip install -U sentence-transformers
53
+ ```
54
+
55
+ Then you can use the model like this:
56
+
57
+ ```python
58
+ from sentence_transformers import SentenceTransformer
59
+
60
+ # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
61
+ sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
62
+
63
+ model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
64
+ embeddings = model.encode(sentences)
65
+ print(embeddings)
66
+ ```
67
+
68
+
69
+ ## Usage (Widget HuggingFace)
70
+ The widget use custom pipeline on top of the default pipeline by adding additional word segmenter before PhobertTokenizer. So you do not need to segment words before using the API:
71
+
72
+ An example could be seen in Hosted inference API.
73
+
74
+
75
+ ## Usage (HuggingFace Transformers)
76
+
77
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
78
+
79
+ ```python
80
+ from transformers import AutoTokenizer, AutoModel
81
+ import torch
82
+
83
+
84
+ #Mean Pooling - Take attention mask into account for correct averaging
85
+ def mean_pooling(model_output, attention_mask):
86
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
87
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
88
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
89
+
90
+
91
+ # Sentences we want sentence embeddings, we could use pyvi, underthesea, RDRSegment to segment words
92
+ sentences = ['Cô ấy là một người vui_tính .', 'Cô ấy cười nói suốt cả ngày .']
93
+
94
+ # Load model from HuggingFace Hub
95
+ tokenizer = AutoTokenizer.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
96
+ model = AutoModel.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
97
+
98
+ # Tokenize sentences
99
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
100
+
101
+ # Compute token embeddings
102
+ with torch.no_grad():
103
+ model_output = model(**encoded_input)
104
+
105
+ # Perform pooling. In this case, mean pooling.
106
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
107
+
108
+ print("Sentence embeddings:")
109
+ print(sentence_embeddings)
110
+ ```
111
+
112
+ ## Training
113
+
114
+ The model was trained with the parameters:
115
+
116
+ **DataLoader**:
117
+
118
+ `torch.utils.data.dataloader.DataLoader` of length 17584 with parameters:
119
+
120
+ ```
121
+ {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
122
+ ```
123
+
124
+ **Loss**:
125
+
126
+ `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
127
+
128
+ ```
129
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
130
+ ```
131
+
132
+ Parameters of the fit()-Method:
133
+
134
+ ```
135
+ {
136
+ "epochs": 15,
137
+ "evaluation_steps": 0,
138
+ "evaluator": "NoneType",
139
+ "max_grad_norm": 1,
140
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
141
+ "optimizer_params": {
142
+ "lr": 2e-05
143
+ },
144
+ "scheduler": "WarmupLinear",
145
+ "steps_per_epoch": null,
146
+ "warmup_steps": 1000,
147
+ "weight_decay": 0.01
148
+ }
149
+ ```
150
+
151
+ ## Full Model Architecture
152
+
153
+ ```
154
+ SentenceTransformer(
155
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
156
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
157
+ )
158
+ ```