|
--- |
|
language: |
|
- vi |
|
license: mit |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- LLMs |
|
- NLP |
|
- Vietnamese |
|
base_model: Viet-Mistral/Vistral-7B-Chat |
|
datasets: |
|
- Tamnemtf/hcmue_qa |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
Chatbots can be programmed with a large knowledge base on answer users' questions on a variety of topics. They can provide facts, data, explanations, definitions, etc. |
|
Complete tasks. Chatbots can be integrated with other systems and APIs to actually do things for users. Based on a user's preferences and past interactions, chatbots can suggest products, services, content and more that might be relevant and useful to the user. |
|
Provide customer service. Chatbots can handle many simple customer service interactions to answer questions, handle complaints, process returns, etc. This allows human agents to focus on more complex issues. |
|
Generate conversational responses - Using NLP and machine learning, chatbots can understand natural language and generate conversational responses, creating fluent interactions. |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
- **Model type:** Mistral |
|
- **Language(s) (NLP):** Vietnamese |
|
- **Finetuned from model :** [Viet-Mistral/Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat) |
|
|
|
### Purpose |
|
This model is a improve from the old one. It's have the new the tokenizer_config.json to use <|im_start|> and <|im_end|> as the additional special tokens. |
|
|
|
### Training Data |
|
|
|
Our dataset was make base on our university sudent notebook. It includes majors, university regulations and other information about our university. |
|
[hcmue_qa](https://huggingface.co/datasets/Tamnemtf/hcmue_qa) |
|
|
|
## Instruction Format |
|
In order to leverage instruction fine-tuning, your prompt should be surrounded by <|im_start|> and <|im_end|> tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id. |
|
|
|
E.g. |
|
```python |
|
role ="user" |
|
prompt ="hi" |
|
chatml = f"<|im_start|>{role}\n{prompt}<|im_end|>\n" |
|
``` |
|
Here is the [dataset](https://huggingface.co/datasets/Tamnemtf/hcmue-new-template) after adding this format. |
|
|
|
### Training Procedure |
|
|
|
```python |
|
# Load LoRA configuration |
|
peft_config = LoraConfig( |
|
r=8, |
|
lora_alpha=16, |
|
target_modules=[ |
|
"q_proj", |
|
"k_proj", |
|
"v_proj", |
|
"o_proj", |
|
"gate_proj", |
|
"up_proj", |
|
"down_proj", |
|
"lm_head", |
|
], |
|
bias="none", |
|
lora_dropout=0.05, # Conventional |
|
task_type="CAUSAL_LM", |
|
) |
|
|
|
#update newchat template |
|
"chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", |
|
} |
|
``` |
|
|
|
## Run |
|
free [colab](https://colab.research.google.com/drive/1qLj6vP0HHQjOLvNbsCD4h6xOlbVVrHAV?usp=sharing) run |
|
|
|
## Training report |
|
[report](https://api.wandb.ai/links/tamtf/11hb5ssm) |
|
|
|
## Contact |
|
|
|
nguyndantdm6@gmail.com |