metadata

license: other
datasets:
  - vietgpt/wikipedia_vi
language:
  - vi
  - en
pipeline_tag: text-generation

URA-LLaMa

Model Details

Model Description

With a strong commitment to enhancing the quality of large language models for the Vietnamese language, a collaborative effort was undertaken by Vietnamese researchers hailing from Ho Chi Minh University of Technology (HCMUT) - Vietnam National University HCMC and Stanford University. Our endeavor involved the meticulous fine-tuning of Meta LLaMa-2 models using Vietnamese articles sourced from Wikipedia and online news websites. In line with our dedication to fostering community progress, we are pleased to offer our models free of charge for research purposes. For those who wish to delve further into our research and its details, we encourage you to explore the comprehensive information provided below.

Developed by:
- Duc Q. Nguyen
- Sang T. Truong
- Toan D. V. Nguyen
- Dong D. Le
- Nhi N. Truong
- Tho Quan
Model type: Text generation
Languages: Vietnamese, English
License:
- Custom license available at LICENSE
- If you want to access our model, please fill in the above license and email us the scanned version at nqduc@hcmut.edu.vn (CC sttruong@cs.stanford.edu, qttho@hcmut.edu.vn)
Finetuned from model: Meta LLaMa-2

Model Sources

We publicly provide starter source code and access to playground of URA-LLaMa 7B. The research paper is comming soon.

Repository: URA-LLaMa Github
Paper: [Comming soon]
Demo:
- Huggingface Playground: https://huggingface.co/spaces/ura-hcmut/ura-llama-playground
- URA Playground: https://www.ura.hcmut.edu.vn/llama-vi/

Uses

This model is primarily designed for text generation. However, as language models, it is versatile and can also function as an encoder for various downstream tasks, akin to other models. For a detailed understanding of its use cases, please refer to the information provided below.

Direct Use

You can use our models to perform various tasks containing

Question answering (with context)
Summarization
Language modelling
Text classification
Translation

Downstream Use

This model can serve as an encoder for a wide range of downstream tasks, spanning from pure natural language processing to combinations of natural language processing with computer vision or speech processing.

Out-of-Scope Use

While our models have undergone fine-tuning using extensive Vietnamese datasets, they may not perform optimally in specialized domains necessitating profound domain expertise, such as medicine, politics, chemistry, etc. We kindly request that you refrain from employing our models for political purposes or any endeavors that may cause harm to individuals or compromise the sovereignty and territorial integrity of Vietnam.

Bias, Risks, and Limitations

Unless required by applicable law, the URA-LLaMa materials and any output and results therefrom are provided on an "as is" basis, without warranties of any kind, either express or implied, including, without limitation, any warranties of title, non-infringement, merchantability, or fitness for a particular purpose. you are solely responsible for determining the appropriateness of using or redistributing the URA-LLaMa materials and assume any risks associated with your use of the URA-LLaMa materials and any output and results.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. In order for the model to work well, you may need perform prompt engineering to create appropriate prompts before inference.

How to Get Started with the Model

Use the code below to get started with the model.

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

pipeline_kwargs={
    "temperature": 1.0,
    "max_new_tokens": 250,
    "top_k": 1,
    "repetition_penalty": 1.1
}
  
if __name__ == "__main__":
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        "ura-hcmut/ura-llama-7b-r64",
        device_map="auto"
    )
    model.config.pretraining_tp = 1
    model.eval()

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        "ura-hcmut/ura-llama-7b-r64", 
        trust_remote_code=True
    )
    tokenizer.pad_token = tokenizer.eos_token
  
    pipeline = transformers.pipeline(
        model=model, 
        tokenizer=tokenizer,
        return_full_text=False,
        task='text-generation',
        **pipeline_kwargs
    )
  
    query_template = "[INST] <<SYS>> Bạn là một trợ lý thông minh. <<SYS>> Hãy trả lời câu hỏi sau.\nCâu hỏi: {query}\nTrả lời:[/INST] "
  
    while True:
        query = input("Query: ")
        if query == "exit":
            break
      
        query = query_template.format(query=query)
        answer = pipeline(query)[0]["generated_text"]
        print(answer)

Finetuning Details

Finetuning Data

List of datasets used for finetuning:

Vietnamese Wikipedia: https://huggingface.co/datasets/vietgpt/wikipedia_vi
Binhvq News Corpus: https://huggingface.co/datasets/vietgpt/binhvq_news_vi

Finetuning Procedure

We utilize the causal language modelling (next token prediction) procedure to finetune our models. Available tutorial is available at https://huggingface.co/docs/transformers/tasks/language_modeling.

Finetuning Hyperparameters

Training regime: BFloat16 mixed precision
Quantization: Normal Float 4bit
Lora rank: 64
Batch size: 128
Optimizer: Paged AdamW 32bit
Learning rate: 1e-5

Evaluation

Our models are tested with various tasks. The detail of evaluation process is comming soon.

Testing Data, Factors & Metrics

Testing Data

[Comming soon]

Factors

Effects of prompt engineering
Effects of few-shot learning
Effects of chain-of-thought
Effects of choice orders
Ability to deal with typo mistakes (robustness)
Ability to deal with unfair situations (fairness)

Metrics

[Comming soon]

Results

[Comming soon]

Summary

[Comming soon]

Environmental Impact

Carbon emissions are estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: 1 x RTX6000 24GB
Hours used: ~2000h
Carbon Emitted: ~23 kg CO2 eq.

Citation

If you use URA-LLaMa materials in your research, please cite our model(s) as below.

BibTeX:

@online{ura-llama,
  author = {Duc Q. Nguyen, Sang T. Truong, Toan D. V. Nguyen, Dong D. Le, Nhi N. Truong, Tho Quan},
  title = {URA-LLaMa: UniveRsal Adapted Large Language Model for Vietnamese},
  year = 2023,
  url = {https://github.com/martinakaduc/ura-llama-public}
}

Model Card Authors

Contact

Mr. Duc Q. Nguyen: nqduc@hcmut.edu.vn
Mr. Sang T. Truong: sttruong@cs.stanford.edu
Assoc. Prof. Tho Quan: qttho@hcmut.edu.vn