|
--- |
|
license: mit |
|
language: |
|
- ja |
|
- zh |
|
pipeline_tag: translation |
|
--- |
|
# Release Notes |
|
* this model is finetuned from mt5-translation-ja_zh |
|
|
|
* reason for making this model<br> |
|
I was testing the model for translation of some of the Japanese game to Chinese<br> |
|
There are several production issues with the original model<br> |
|
so I did some "supervised" training just to fix them <br> |
|
|
|
# 模型公开声明 |
|
* 这个模型由 mt5-translation-ja_zh 继续训练得来 |
|
|
|
* 制作这个模型的原因<br> |
|
尝试使用各类模型进行游戏文本翻译的工作,游戏文本有非常典型的文本对应关系<br> |
|
尤其是游戏文本的翻译中,部分token必须被翻译,部分token必须保持原样,其主要的文本行数必须保持原样<br> |
|
因mt5的预训练包括对应关系,因而较为优秀<br> |
|
因为发现大佬已经进行了翻译的预训练,就直接在基础上精调<br> |
|
修复了一些对应的翻译出的位置问题,训练了一些需要的翻译词汇<br> |
|
* 本模型缺陷<br> |
|
暂时只制作了mt5-large模型,需要大概8g以上的显存,过剩比较多<br> |
|
为了方便使用,设置成大batch一波推的做法,充分利用gpu资源,但它不会看上下文,因此认为是很大的弊端<br> |
|
数据集中固定翻译的词汇量不足,因此很多翻译会给你它知道的其他语言(一般是英文)<br> |
|
经过一些努力矫正后,它现在会zero-shot的给你一句空耳(出现这个zero-shot特性的时候我们翻译组都绷不住了)<br> |
|
|
|
# A more precise example using it |
|
# 使用指南 |
|
```python |
|
from transformers import pipeline |
|
|
|
#pipe = pipeline("translation",model="mt5-trained",tokenizer="mt5-trained",repetition_penalty=1.4,batch_size=1,max_length=256) |
|
pipe = pipeline("translation", |
|
model="mt5-translation-ja_zh-game-v0.1", |
|
repetition_penalty=1.4, |
|
batch_size=1, |
|
max_length=256 |
|
) |
|
|
|
def translate_batch(batch, language='<-ja2zh->'): # batch is an array of string |
|
i=0 # quickly format the list |
|
while i<len(batch): |
|
batch[i]=f'{language} {batch[i]}' |
|
i+=1 |
|
translated=pipe(batch) |
|
result=[] |
|
i=0 |
|
while i<len(translated): |
|
result.append(translated[i]['translation_text']) |
|
i+=1 |
|
return result |
|
|
|
inputs=[] |
|
|
|
print(translate_batch(inputs)) |
|
``` |
|
|
|
# simple webui |
|
# 暂时的网页UI |
|
|
|
# roadmap |
|
train mt-5 small and rwkv<br> |
|
make lora training script and ui<br> |
|
create algorism that save no-confidence translations into a db for manual correction<br> |
|
search the manual translatioin db with sentencepiece search to make it work with "previous translations"<br> |
|
|
|
搞mt5-small和rwkv,rwkv能读上下文<br> |
|
制造lora training脚本和ui<br> |
|
让ai将不确定的翻译文本导出用于人工翻译矫正<br> |
|
使用sentencepiece进行ai检索,获取相似的“上文翻译“,大幅提高ai翻译用词的一致性<br> |