File size: 1,711 Bytes
3cb884d
 
0a8fa99
 
 
3cb884d
 
 
 
 
 
 
0a8fa99
d87e830
 
3cb884d
 
 
 
 
 
 
 
 
 
 
0a8fa99
3cb884d
 
 
549e8a0
0a8fa99
3cb884d
 
 
 
 
 
 
0a8fa99
3cb884d
0a8fa99
3cb884d
 
0a8fa99
3cb884d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
language: zh
pipeline_tag: fill-mask
widget:
- text: "今天[MASK]情很好"
---

# albert_chinese_base

This a albert_chinese_base model from [Google's github](https://github.com/google-research/ALBERT)  
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)


## Notice
*Support AutoTokenizer*

Since sentencepiece is not used in albert_chinese_base model   
you have to call BertTokenizer instead of AlbertTokenizer !!!    
we can eval it using an example on MaskedLM   
   
由於 albert_chinese_base 模型沒有用 sentencepiece   
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!   
我們可以跑MaskedLM預測來驗證這個做法是否正確   
   
## Justify (驗證有效性)
```python
from transformers import AutoTokenizer, AlbertForMaskedLM
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_base'
tokenizer = AutoTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos],dim=-1).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token, logit_prob[predicted_index])
```
Result: `感 0.36333346366882324`