iryneko571
/

mt5-translation-ja_zh-game-large

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

mt5-translation-ja_zh-game-large / README.md

iryneko571's picture

Update README.md

8a3a9fa verified 11 months ago

|

2.89 kB

	---
	license: mit
	language:
	- ja
	- zh
	pipeline_tag: translation
	---
	# Release Notes
	* this model is finetuned from mt5-translation-ja_zh

	* reason for making this model<br>
	I was testing the model for translation of some of the Japanese game to Chinese<br>
	There are several production issues with the original model<br>
	so I did some "supervised" training just to fix them <br>

	# 模型公开声明
	* 这个模型由 mt5-translation-ja_zh 继续训练得来

	* 制作这个模型的原因<br>
	尝试使用各类模型进行游戏文本翻译的工作，游戏文本有非常典型的文本对应关系<br>
	尤其是游戏文本的翻译中，部分token必须被翻译，部分token必须保持原样，其主要的文本行数必须保持原样<br>
	因mt5的预训练包括对应关系，因而较为优秀<br>
	因为发现大佬已经进行了翻译的预训练，就直接在基础上精调<br>
	修复了一些对应的翻译出的位置问题，训练了一些需要的翻译词汇<br>
	* 本模型缺陷<br>
	暂时只制作了mt5-large模型，需要大概8g以上的显存，过剩比较多<br>
	为了方便使用，设置成大batch一波推的做法，充分利用gpu资源，但它不会看上下文，因此认为是很大的弊端<br>
	数据集中固定翻译的词汇量不足，因此很多翻译会给你它知道的其他语言（一般是英文）<br>
	经过一些努力矫正后，它现在会zero-shot的给你一句空耳（出现这个zero-shot特性的时候我们翻译组都绷不住了）<br>

	# A more precise example using it
	# 使用指南
	```python
	from transformers import pipeline

	#pipe = pipeline("translation",model="mt5-trained",tokenizer="mt5-trained",repetition_penalty=1.4,batch_size=1,max_length=256)
	pipe = pipeline("translation",
	model="mt5-translation-ja_zh-game-v0.1",
	repetition_penalty=1.4,
	batch_size=1,
	max_length=256
	)

	def translate_batch(batch, language='<-ja2zh->'): # batch is an array of string
	i=0 # quickly format the list
	while i<len(batch):
	batch[i]=f'{language} {batch[i]}'
	i+=1
	translated=pipe(batch)
	result=[]
	i=0
	while i<len(translated):
	result.append(translated[i]['translation_text'])
	i+=1
	return result

	inputs=[]

	print(translate_batch(inputs))
	```

	# simple webui
	# 暂时的网页UI

	# roadmap
	train mt-5 small and rwkv<br>
	make lora training script and ui<br>
	create algorism that save no-confidence translations into a db for manual correction<br>
	search the manual translatioin db with sentencepiece search to make it work with "previous translations"<br>

	搞mt5-small和rwkv，rwkv能读上下文<br>
	制造lora training脚本和ui<br>
	让ai将不确定的翻译文本导出用于人工翻译矫正<br>
	使用sentencepiece进行ai检索，获取相似的“上文翻译“，大幅提高ai翻译用词的一致性<br>