Update README.md
Browse files
README.md
CHANGED
@@ -21,82 +21,4 @@ extra_gated_fields:
|
|
21 |
|
22 |
## Model Description
|
23 |
|
24 |
-
|
25 |
-
|
26 |
-
1. Extend the tokenizer of Mistral 7B to better support Vietnamese.
|
27 |
-
2. Perform continual pre-training for Mistral over a diverse dataset of Vietnamese texts that are meticulously cleaned and deduplicated.
|
28 |
-
3. Perform supervised fine-tuning for the model using diverse instruction data. We design a set of instructions to align the model with the safety criteria in Vietnam.
|
29 |
-
|
30 |
-
GGUF Version: Running Vistral on **your local computer** [here](https://huggingface.co/chiennv/Vistral-7B-Chat-gguf)
|
31 |
-
|
32 |
-
**Note**: To deploy Vistral locally (e.g. on LM Studio), make sure that you are utilizing the specified chat template, download [here](https://huggingface.co/uonlp/Vistral-7B-Chat-gguf/blob/main/template_chat.json). This step is very crucial to ensure that Vistral generates accurate answers.
|
33 |
-
|
34 |
-
### Acknowledgement:
|
35 |
-
We thank Hessian AI and LAION for their support and compute in order to train this model. Specifically, we gratefully acknowledge LAION for providing access to compute budget granted by Gauss Centre for Supercomputing e.V. and by the John von Neumann Institute for Computing (NIC) on the supercomputers JUWELS Booster and JURECA at Juelich Supercomputing Centre (JSC).
|
36 |
-
|
37 |
-
### Data
|
38 |
-
We will make the data available after we release the technical report for this model. However, we have made some of the data available here in our [CulutraY](https://huggingface.co/datasets/ontocord/CulturaY) and [CulutraX](https://huggingface.co/datasets/uonlp/CulturaX) datasets.
|
39 |
-
|
40 |
-
## Usage
|
41 |
-
|
42 |
-
To enable single/multi-turn conversational chat with `Vistral-7B-Chat`, you can use the default chat template format:
|
43 |
-
|
44 |
-
```python
|
45 |
-
import torch
|
46 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
47 |
-
|
48 |
-
system_prompt = "Bạn là một trợ lí Tiếng Việt nhiệt tình và trung thực. Hãy luôn trả lời một cách hữu ích nhất có thể, đồng thời giữ an toàn.\n"
|
49 |
-
system_prompt += "Câu trả lời của bạn không nên chứa bất kỳ nội dung gây hại, phân biệt chủng tộc, phân biệt giới tính, độc hại, nguy hiểm hoặc bất hợp pháp nào. Hãy đảm bảo rằng các câu trả lời của bạn không có thiên kiến xã hội và mang tính tích cực."
|
50 |
-
system_prompt += "Nếu một câu hỏi không có ý nghĩa hoặc không hợp lý về mặt thông tin, hãy giải thích tại sao thay vì trả lời một điều gì đó không chính xác. Nếu bạn không biết câu trả lời cho một câu hỏi, hãy trẳ lời là bạn không biết và vui lòng không chia sẻ thông tin sai lệch."
|
51 |
-
|
52 |
-
tokenizer = AutoTokenizer.from_pretrained('Viet-Mistral/Vistral-7B-Chat')
|
53 |
-
model = AutoModelForCausalLM.from_pretrained(
|
54 |
-
'Viet-Mistral/Vistral-7B-Chat',
|
55 |
-
torch_dtype=torch.bfloat16, # change to torch.float16 if you're using V100
|
56 |
-
device_map="auto",
|
57 |
-
use_cache=True,
|
58 |
-
)
|
59 |
-
|
60 |
-
conversation = [{"role": "system", "content": system_prompt }]
|
61 |
-
while True:
|
62 |
-
human = input("Human: ")
|
63 |
-
if human.lower() == "reset":
|
64 |
-
conversation = [{"role": "system", "content": system_prompt }]
|
65 |
-
print("The chat history has been cleared!")
|
66 |
-
continue
|
67 |
-
|
68 |
-
conversation.append({"role": "user", "content": human })
|
69 |
-
input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt").to(model.device)
|
70 |
-
|
71 |
-
out_ids = model.generate(
|
72 |
-
input_ids=input_ids,
|
73 |
-
max_new_tokens=768,
|
74 |
-
do_sample=True,
|
75 |
-
top_p=0.95,
|
76 |
-
top_k=40,
|
77 |
-
temperature=0.1,
|
78 |
-
repetition_penalty=1.05,
|
79 |
-
)
|
80 |
-
assistant = tokenizer.batch_decode(out_ids[:, input_ids.size(1): ], skip_special_tokens=True)[0].strip()
|
81 |
-
print("Assistant: ", assistant)
|
82 |
-
conversation.append({"role": "assistant", "content": assistant })
|
83 |
-
```
|
84 |
-
|
85 |
-
## Performance
|
86 |
-
|
87 |
-
We evaluated our Vistral model using the [VMLU leaderboard](https://vmlu.ai/leaderboard), a reliable framework for evaluating large language models in Vietnamese across various tasks. These tasks involve multiple-choice questions in STEM, Humanities, Social Sciences, and more. Our model achieved an average score of 50.07%, surpassing ChatGPT's performance of 46.33% significantly.
|
88 |
-
<p align="center"> <img src="official_vmlu.png" width="650" /> </p>
|
89 |
-
|
90 |
-
**Disclaimer: Despite extensive red teaming and safety alignment efforts, our model may still pose potential risks, including but not limited to hallucination, toxic content, and bias issues. We strongly encourage researchers and practitioners to fully acknowledge these potential risks and meticulously assess and secure the model before incorporating it into their work. Users are responsible for adhering to and complying with their governance and regulations. The authors retain the right to disclaim any accountability for potential damages or liability resulting from the use of the model.**
|
91 |
-
|
92 |
-
## Citation
|
93 |
-
|
94 |
-
If you find our project useful, we hope you would kindly star our repo and cite our work as follows: huu@ontocord.ai, chienn@uoregon.edu, nguyenhuuthuat09@gmail.com and thienn@uoregon.edu
|
95 |
-
|
96 |
-
```
|
97 |
-
@article{chien2023vistral,
|
98 |
-
author = {Chien Van Nguyen, Thuat Nguyen, Quan Nguyen, Huy Nguyen, Björn Plüster, Nam Pham, Huu Nguyen, Patrick Schramowski, Thien Nguyen},
|
99 |
-
title = {Vistral-7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese},
|
100 |
-
year = 2023,
|
101 |
-
}
|
102 |
-
```
|
|
|
21 |
|
22 |
## Model Description
|
23 |
|
24 |
+
Clone of Viet-Mistral/Vistral-7B-Chat
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|