Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/voidful/albert_chinese_base/README.md
README.md
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: zh
|
3 |
+
---
|
4 |
+
|
5 |
+
# albert_chinese_base
|
6 |
+
|
7 |
+
This a albert_chinese_base model from [Google's github](https://github.com/google-research/ALBERT)
|
8 |
+
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
|
9 |
+
|
10 |
+
## Attention (注意)
|
11 |
+
|
12 |
+
Since sentencepiece is not used in albert_chinese_base model
|
13 |
+
you have to call BertTokenizer instead of AlbertTokenizer !!!
|
14 |
+
we can eval it using an example on MaskedLM
|
15 |
+
|
16 |
+
由於 albert_chinese_base 模型沒有用 sentencepiece
|
17 |
+
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
|
18 |
+
我們可以跑MaskedLM預測來驗證這個做法是否正確
|
19 |
+
|
20 |
+
## Justify (驗證有效性)
|
21 |
+
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
|
22 |
+
```python
|
23 |
+
from transformers import *
|
24 |
+
import torch
|
25 |
+
from torch.nn.functional import softmax
|
26 |
+
|
27 |
+
pretrained = 'voidful/albert_chinese_base'
|
28 |
+
tokenizer = BertTokenizer.from_pretrained(pretrained)
|
29 |
+
model = AlbertForMaskedLM.from_pretrained(pretrained)
|
30 |
+
|
31 |
+
inputtext = "今天[MASK]情很好"
|
32 |
+
|
33 |
+
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
|
34 |
+
|
35 |
+
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
36 |
+
outputs = model(input_ids, masked_lm_labels=input_ids)
|
37 |
+
loss, prediction_scores = outputs[:2]
|
38 |
+
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
|
39 |
+
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
|
40 |
+
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
|
41 |
+
print(predicted_token,logit_prob[predicted_index])
|
42 |
+
```
|
43 |
+
Result: `感 0.36333346366882324`
|