Katsumata420 commited on
Commit
ffd4f7c
Β·
verified Β·
1 Parent(s): 5547ec3

Fix bugs for bias

Browse files

Fix bugs for the bias of up_proj, converted from Megatron-LM's checkpoints.

Files changed (2) hide show
  1. README.md +20 -15
  2. model.safetensors +1 -1
README.md CHANGED
@@ -10,6 +10,11 @@ language:
10
  The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
11
  It is designed for use in Japanese.
12
 
 
 
 
 
 
13
  ## Model Details
14
 
15
  ### Model Description
@@ -19,12 +24,12 @@ The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
19
  It is designed for use in Japanese.
20
 
21
  This model offers several advanced features compared to traditional BERT models:
22
- - **PreNorm**: Improved stability during training.
23
- - **SwiGLU**: Enhanced activation function for better performance.
24
- - **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
25
- - **Max Sequence Length**: 2048 tokens, allowing for longer context.
26
- - **Parameters**: 1.3 billion parameters.
27
- - **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
28
  - **Token Type IDs**: Not used in this model.
29
 
30
  ### Model Sources
@@ -44,9 +49,9 @@ Depending on your use case, follow the appropriate section below.
44
 
45
  This model is pre-trained using Masked Language Modeling.
46
  The mask token used is `<MASK|LLM-jp>`.
47
- Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
48
-
49
- Example code for direct use:
50
 
51
  ```python
52
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
@@ -98,7 +103,7 @@ The model was trained on the following hyperparameters.
98
  - Floating point expression: BF16
99
 
100
  ## Evaluation
101
- We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
102
  We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
103
 
104
  | Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
@@ -106,7 +111,7 @@ We adjusted the learning rate and training epochs for each model and task in acc
106
  | tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
107
  | tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
108
  | ku-nlp/deberta-v3-base-japaneseγ€€γ€€γ€€γ€€| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
109
- | retrieva-jp/bert-1.3bγ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€| 0.952 | 0.916 | 0.877 | 0.896 | 0.916 | 0.879 | 0.815 |
110
 
111
 
112
  ## Technical Specifications
@@ -121,9 +126,9 @@ The RetrievaBERT model is based on BERT with the following hyperparameters:
121
  - Maximum length of position embeddings: 2048
122
 
123
  As mentioned earlier, the main differences from the original BERT are:
124
- - PreNorm: Improved stability during training.
125
- - SwiGLU: Enhanced activation function for better performance.
126
- - Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
127
 
128
 
129
  ### Compute Infrastructure
@@ -145,4 +150,4 @@ https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
145
  Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
146
 
147
  ## Model Card Contact
148
- pr@retrieva.jp
 
10
  The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
11
  It is designed for use in Japanese.
12
 
13
+ ## What's New
14
+
15
+ - November 2024 (`v1.0.1`): Bug fix for the model parameters.
16
+ - The up_proj's bias was initialized with the gate's one. This bug was fixed.
17
+
18
  ## Model Details
19
 
20
  ### Model Description
 
24
  It is designed for use in Japanese.
25
 
26
  This model offers several advanced features compared to traditional BERT models:
27
+ - **PreNorm**: Improved stability during training.
28
+ - **SwiGLU**: Enhanced activation function for better performance.
29
+ - **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
30
+ - **Max Sequence Length**: 2048 tokens, allowing for longer context.
31
+ - **Parameters**: 1.3 billion parameters.
32
+ - **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
33
  - **Token Type IDs**: Not used in this model.
34
 
35
  ### Model Sources
 
49
 
50
  This model is pre-trained using Masked Language Modeling.
51
  The mask token used is `<MASK|LLM-jp>`.
52
+ Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
53
+
54
+ Example code for direct use:
55
 
56
  ```python
57
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
 
103
  - Floating point expression: BF16
104
 
105
  ## Evaluation
106
+ We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
107
  We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
108
 
109
  | Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
 
111
  | tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
112
  | tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
113
  | ku-nlp/deberta-v3-base-japaneseγ€€γ€€γ€€γ€€| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
114
+ | retrieva-jp/bert-1.3bγ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€γ€€| 0.959 | 0.917 | 0.881 | 0.898 | 0.875 | 0.874 | 0.827 |
115
 
116
 
117
  ## Technical Specifications
 
126
  - Maximum length of position embeddings: 2048
127
 
128
  As mentioned earlier, the main differences from the original BERT are:
129
+ - PreNorm: Improved stability during training.
130
+ - SwiGLU: Enhanced activation function for better performance.
131
+ - Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
132
 
133
 
134
  ### Compute Infrastructure
 
150
  Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
151
 
152
  ## Model Card Contact
153
+ pr@retrieva.jp
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c42e82e5fd0d4fd37e4b158b8669abfc465c5d16483e3e63ffa2fd7616592ad7
3
  size 2602880000
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:994bd099f4bb0c9bab36ed16e1a8271f46f637de6b06e32fa1f29643d7b528c9
3
  size 2602880000