Randeng-PPVAE-1.2B-Augmentation-Chinese

Main Page:Fengshenbang
Github: Fengshenbang-LM

简介 Brief Introduction

PPVAE(Pre-train and Plug-in Variational Auto-Encoder) 可以通过少量类别文本的训练生成大量该类别的增强样本。 PPVAE是一个由两个VAE组成的层级框架：预训练VAE的编码器得到文本全局隐空间，解码器将隐向量解码为文本；PluginVAE为一个轻量级VAE，学习从全局隐空间到条件隐空间的相互映射，该映射只需要少量条件文本即可训练完成。

PPVAE (Pre-train and Plug-in Variational Auto-Encoder) can generate a large number of category-specific samples from the training of a small number of category texts. PPVAE is a hierarchical framework consisting of two VAEs: the encoder of the pre-trained VAE gets the text global hidden space and the decoder decodes the hidden vector into text; PluginVAE is a lightweight VAE that learns the transformation from the global hidden space to the conditional hidden space, which requires only a small amount of conditional text to be trained.

PPVAE参考论文Pre-train and Plug-in: Flexible Conditional Text Generation with Variational Auto-Encoders.

PPVAE reference paper Pre-training and Plug-in: Flexible Conditional Text Generation with Variable Autoencoders.

模型分类 Model Taxonomy

需求 Demand	任务 Task	系列 Series	模型 Model	参数 Parameter	额外 Extra
数据增强 Augmentation	自然语言生成 NLG	燃灯 Randeng	VAE	1.2B	pluginVAE

模型信息 Model Information

Pretrained VAE:

训练语料：悟道语料库（280G版本）

Training Corpus: Wudao Corpus (with 280G samples)

参考模型：Randeng-DAVAE-1.2B-General-Chinese

Reference model:Randeng-DAVAE-1.2B-General-Chinese

PluginVAE:

编码器：三层MLP，将隐向量从全局隐空间映射到类别隐空间；

解码器：三层MLP，将隐向量从类别隐空间映射到全局隐空间。

训练语料：少量类别文本。

Encoder: three-layer MLP that maps the hidden vector from the global hidden space to the category hidden space.

Decoder: three-layer MLP, mapping hidden vectors from the category hidden space to the global hidden space.

Training corpus: a small amount of categorical text.

使用 Usage

git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
cd Fengshenbang-LM
pip install --editable .

import torch
from transformers import BertTokenizer,T5Tokenizer
from fengshen.models.PPVAE.pluginVAE import PPVAEModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_texts = [
    "非常好的一个博物馆，是我所有去过的博物馆里感觉最正规的一家.", 
    "这是我来长沙最最期待的一定要去的地方，总算今天特地去瞻仰千古遗容了，真好。", 
    "地方很大 很气派~~可以逛很久~~~去的时候是免费的~不过要安检~~~里面的马王堆~幸追夫人~还是很不错的",
    "绝对不虚此行！相当震撼的展览！原来古人也化妆，还有假发。记忆最深的是那个藕汤。可惜真颜已不得见。", 
    "去过三次，个人认为这是长沙最值得去的地方.", 
    "非常喜欢的一家博物馆，里面可看的东西很多，当然最吸引我的就是那个辛追夫人和“素纱单衣”，果然不是盖的~赞~~~", 
    "这两年也有很多机会去博物馆。。。不过还是想说湖南省博物馆是非常有特色的。。。真是上了一节很生动的历史课。",
    "网上订票去的，还是很顺利的就进去了，里面挺清净的，外围的环境也不错，还有鸽子可以喂。",
]
encoder_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Randeng-PPVAE-1.2B-Augmentation-Chinese")
decoder_tokenizer = T5Tokenizer.from_pretrained("IDEA-CCNL/Randeng-PPVAE-1.2B-Augmentation-Chinese", eos_token = '<|endoftext|>', pad_token = '<pad>',extra_ids=0)
decoder_tokenizer.add_special_tokens({'bos_token':'<bos>'})
ppvae_model = PPVAEModel.from_pretrained("IDEA-CCNL/Randeng-PPVAE-1.2B-Augmentation-Chinese").to(device)
ppvae_model.train_plugin(encoder_tokenizer,decoder_tokenizer,input_texts,negative_samples=None)
# n:输出样本数量
texts = ppvae_model.generate(n=5)
print(texts)
# 生成结果样例：
# ['同学很推荐那里,自然会有好的风景.那里物价很便宜,真的不错。', 
# '同学说一会去盛国,可能是我去的比较多!故居真的很漂亮,夜景也特别好看。'
# '我的第一次旅行没有白来,最后领略了有些风吹草低见牛羊的味道,谢谢本次疗养。', 
# '同学一打听:这里距离世纪公园,还有最近的香山营不过200米,海拔也才四千米。', 
# '我发现那边很文艺!!有机会去过的,真是土耳其当地口音~还是很干净!。', ]

引用 Citation

如果您在您的工作中使用了我们的模型，可以引用我们的网站:

If you are using the resource for your work, please cite our website:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}