Katsumata420
commited on
Fix bugs for bias
Browse filesFix bugs for the bias of up_proj, converted from Megatron-LM's checkpoints.
- README.md +20 -15
- model.safetensors +1 -1
README.md
CHANGED
@@ -10,6 +10,11 @@ language:
|
|
10 |
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
|
11 |
It is designed for use in Japanese.
|
12 |
|
|
|
|
|
|
|
|
|
|
|
13 |
## Model Details
|
14 |
|
15 |
### Model Description
|
@@ -19,12 +24,12 @@ The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
|
|
19 |
It is designed for use in Japanese.
|
20 |
|
21 |
This model offers several advanced features compared to traditional BERT models:
|
22 |
-
- **PreNorm**: Improved stability during training.
|
23 |
-
- **SwiGLU**: Enhanced activation function for better performance.
|
24 |
-
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
|
25 |
-
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
|
26 |
-
- **Parameters**: 1.3 billion parameters.
|
27 |
-
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
|
28 |
- **Token Type IDs**: Not used in this model.
|
29 |
|
30 |
### Model Sources
|
@@ -44,9 +49,9 @@ Depending on your use case, follow the appropriate section below.
|
|
44 |
|
45 |
This model is pre-trained using Masked Language Modeling.
|
46 |
The mask token used is `<MASK|LLM-jp>`.
|
47 |
-
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
|
48 |
-
|
49 |
-
Example code for direct use:
|
50 |
|
51 |
```python
|
52 |
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
@@ -98,7 +103,7 @@ The model was trained on the following hyperparameters.
|
|
98 |
- Floating point expression: BF16
|
99 |
|
100 |
## Evaluation
|
101 |
-
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
|
102 |
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
|
103 |
|
104 |
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|
@@ -106,7 +111,7 @@ We adjusted the learning rate and training epochs for each model and task in acc
|
|
106 |
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
|
107 |
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
|
108 |
| ku-nlp/deberta-v3-base-japaneseγγγγ| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
|
109 |
-
| retrieva-jp/bert-1.3bγγγγγγγγγγγγγγγγγγγγγγγγ| 0.
|
110 |
|
111 |
|
112 |
## Technical Specifications
|
@@ -121,9 +126,9 @@ The RetrievaBERT model is based on BERT with the following hyperparameters:
|
|
121 |
- Maximum length of position embeddings: 2048
|
122 |
|
123 |
As mentioned earlier, the main differences from the original BERT are:
|
124 |
-
- PreNorm: Improved stability during training.
|
125 |
-
- SwiGLU: Enhanced activation function for better performance.
|
126 |
-
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
|
127 |
|
128 |
|
129 |
### Compute Infrastructure
|
@@ -145,4 +150,4 @@ https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
|
|
145 |
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
|
146 |
|
147 |
## Model Card Contact
|
148 |
-
pr@retrieva.jp
|
|
|
10 |
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
|
11 |
It is designed for use in Japanese.
|
12 |
|
13 |
+
## What's New
|
14 |
+
|
15 |
+
- November 2024 (`v1.0.1`): Bug fix for the model parameters.
|
16 |
+
- The up_proj's bias was initialized with the gate's one. This bug was fixed.
|
17 |
+
|
18 |
## Model Details
|
19 |
|
20 |
### Model Description
|
|
|
24 |
It is designed for use in Japanese.
|
25 |
|
26 |
This model offers several advanced features compared to traditional BERT models:
|
27 |
+
- **PreNorm**: Improved stability during training.
|
28 |
+
- **SwiGLU**: Enhanced activation function for better performance.
|
29 |
+
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
|
30 |
+
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
|
31 |
+
- **Parameters**: 1.3 billion parameters.
|
32 |
+
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
|
33 |
- **Token Type IDs**: Not used in this model.
|
34 |
|
35 |
### Model Sources
|
|
|
49 |
|
50 |
This model is pre-trained using Masked Language Modeling.
|
51 |
The mask token used is `<MASK|LLM-jp>`.
|
52 |
+
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
|
53 |
+
|
54 |
+
Example code for direct use:
|
55 |
|
56 |
```python
|
57 |
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
|
|
103 |
- Floating point expression: BF16
|
104 |
|
105 |
## Evaluation
|
106 |
+
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
|
107 |
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
|
108 |
|
109 |
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|
|
|
111 |
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
|
112 |
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
|
113 |
| ku-nlp/deberta-v3-base-japaneseγγγγ| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
|
114 |
+
| retrieva-jp/bert-1.3bγγγγγγγγγγγγγγγγγγγγγγγγ| 0.959 | 0.917 | 0.881 | 0.898 | 0.875 | 0.874 | 0.827 |
|
115 |
|
116 |
|
117 |
## Technical Specifications
|
|
|
126 |
- Maximum length of position embeddings: 2048
|
127 |
|
128 |
As mentioned earlier, the main differences from the original BERT are:
|
129 |
+
- PreNorm: Improved stability during training.
|
130 |
+
- SwiGLU: Enhanced activation function for better performance.
|
131 |
+
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
|
132 |
|
133 |
|
134 |
### Compute Infrastructure
|
|
|
150 |
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
|
151 |
|
152 |
## Model Card Contact
|
153 |
+
pr@retrieva.jp
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 2602880000
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:994bd099f4bb0c9bab36ed16e1a8271f46f637de6b06e32fa1f29643d7b528c9
|
3 |
size 2602880000
|