JoeyHeisenberg
commited on
Commit
•
f485db0
1
Parent(s):
ff010c9
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
language:
|
4 |
+
- zh
|
5 |
+
- en
|
6 |
+
---
|
7 |
+
# BlueLM
|
8 |
+
|
9 |
+
<p align="center">
|
10 |
+
🖥 <a href="https://github.com/vivo-ai-lab/BlueLM" target="_blank">github</a> • 📜 <a href="https://huggingface.co/vivo-ai/BlueLM-7B-Base/blob/main/MODEL_LICENSE" target="_blank">LICENSE</a> • 🎯 <a href="https://developers.vivo.com/product/ai/bluelm" target="_blank">vivo Developers</a> • 🗨 <a href="https://github.com/vivo-ai-lab/BlueLM/blob/main/resources/wechat.png" target="_blank">WeChat</a>
|
11 |
+
</p>
|
12 |
+
|
13 |
+
## 模型介绍/Introduction
|
14 |
+
|
15 |
+
BlueLM 是由 vivo AI 全球研究院自主研发的大规模预训练语言模型,本次发布包含 7B 基础模型和 7B 对话模型,同时我们开源了支持 **32K** 的长文本基础模型和对话模型。
|
16 |
+
|
17 |
+
- **更大量的优质数据**:高质量语料库进行训练,规模达到了 **2.6 万亿** 的 token 数,该语料库包含中文、英文以及少量日韩数据。
|
18 |
+
- **更优的效果**:其中 BlueLM-7B-Chat 在 **C-Eval** 和 **CMMLU** 上均取得领先结果,对比同尺寸开源模型中具有较强的竞争力。
|
19 |
+
- **长文本支持**:BlueLM-7B-Base-32K 和 BlueLM-7B-Chat-32K 均支持 **32K** 长文本,在保持基础能力相当情况下,能够支持更长上下文理解。
|
20 |
+
- **协议说明**:BlueLM 系列欢迎开发者进行学术研究和商业应用。
|
21 |
+
|
22 |
+
BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models.
|
23 |
+
|
24 |
+
- **High-quality Data**: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus mainly consists of Chinese and English data, with a small amount of Japanese and Korean data.
|
25 |
+
- **Stronger Performance**: BlueLM-7B-Chat achieves a strong competitive performance in C-Eval and CMMLU benchmarks of the same size.
|
26 |
+
- **Longer Context**: We have extended the context length of both BlueLM-7B-Base-32K and BlueLM-7B-Chat-32K models from 2K to 32K. The models can support longer context understanding while maintaining the same basic capabilities.
|
27 |
+
- **Model License**: BlueLM weights are open for academic research and commercial use.
|
28 |
+
|
29 |
+
本次发布基座模型下载链接见:
|
30 |
+
|
31 |
+
The release versions and hugging face download links are listed in the table below:
|
32 |
+
|
33 |
+
| | Base Model | Chat Model | 4bits Quantized Chat Model |
|
34 |
+
|:---:|:--------------------:|:--------------------:|:--------------------------:|
|
35 |
+
| 7B-2k | [BlueLM-7B-Base](https://huggingface.co/vivo-ai/BlueLM-7B-Base) | [BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat) | [BlueLM-7B-Chat-4bits](https://huggingface.co/vivo-ai/BlueLM-7B-Chat-4bits) |
|
36 |
+
| 7B-32K | [BlueLM-7B-Base-32K](https://huggingface.co/vivo-ai/BlueLM-7B-Base-32K) | [BlueLM-7B-Chat-32K](https://huggingface.co/vivo-ai/BlueLM-7B-Chat-32K) | - |
|
37 |
+
|
38 |
+
## 评测结果/Benchmark Results
|
39 |
+
|
40 |
+
我们在 LongBench 评测集上对我们的 BlueLM-7B-Chat-32K 模型进行了测试,具体结果如下表所示:
|
41 |
+
|
42 |
+
We tested our BlueLM-7B-Chat-32K on the LongBench dataset and the results are shown in the table below:
|
43 |
+
|
44 |
+
| Model | Average | Summary | Single-Doc QA | Multi-Doc QA | Code | Few-shot | Synthetic |
|
45 |
+
|:----------------------|:-----|:---------|:--------------|:--------------|:------|:---------|:----------|
|
46 |
+
| BlueLM-7B-Chat-32K | 41.2 | 18.8 | 35.6 | 36.2 | 54.2 | 56.9 | 45.5 |
|
47 |
+
|
48 |
+
## 推理部署/Inference and Deployment
|
49 |
+
|
50 |
+
```python
|
51 |
+
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
|
52 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Base-32K", trust_remote_code=True, use_fast=False)
|
53 |
+
>>> model = AutoModelForCausalLM.from_pretrained("vivo-ai/BlueLM-7B-Base-32K", device_map="cuda:0", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
54 |
+
>>> model = model.eval()
|
55 |
+
>>> inputs = tokenizer("儒林外史->吴敬梓\n隋唐演义->褚人获\n红楼梦->", return_tensors="pt")
|
56 |
+
>>> inputs = inputs.to("cuda:0")
|
57 |
+
>>> pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
|
58 |
+
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
|
59 |
+
儒林外史->吴敬梓
|
60 |
+
隋唐演义->褚人获
|
61 |
+
红楼梦->曹雪芹
|
62 |
+
三国演义->罗贯中
|
63 |
+
水浒传->施耐庵
|
64 |
+
西游记->吴承恩
|
65 |
+
聊斋志异->蒲松龄
|
66 |
+
金瓶梅->兰陵笑笑生
|
67 |
+
封神演义->许仲琳
|
68 |
+
三言二拍->冯梦龙
|
69 |
+
东周列国志->冯梦龙
|
70 |
+
```
|
71 |
+
|
72 |
+
更多使用说明,请参考我们的 [Github 仓库](https://github.com/vivo-ai-lab/BlueLM)。
|
73 |
+
|
74 |
+
For more instructions, please refer to our [Github Repo](https://github.com/vivo-ai-lab/BlueLM).
|
75 |
+
|
76 |
+
## 协议/License
|
77 |
+
|
78 |
+
社区使用代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源,且使用 BlueLM 模型权重需要遵循 [vivo_BlueLM模型许可协议](https://huggingface.co/vivo-ai/BlueLM-7B-Base/blob/main/MODEL_LICENSE)。
|
79 |
+
|
80 |
+
Our code is licensed under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) and [Community License for BlueLM Model](https://huggingface.co/vivo-ai/BlueLM-7B-Base/blob/main/MODEL_LICENSE).
|