CRIA v1.3

πŸ’‘ Article | πŸ’» Github | πŸ“” Colab 1,2

What is CRIA?

krΔ“-Ι™ plural crias. : a baby llama, alpaca, vicuΓ±a, or guanaco.

Cria Logo
or what ChatGPT suggests, "Crafting a Rapid prototype of an Intelligent llm App using open source resources".

The initial objective of the CRIA project is to develop a comprehensive end-to-end chatbot system, starting from the instruction-tuning of a large language model and extending to its deployment on the web using frameworks such as Next.js.

Specifically, we have fine-tuned the llama-2-7b-chat-hf model with QLoRA (4-bit precision) using the mlabonne/CodeLlama-2-20k dataset. This fine-tuned model serves as the backbone for the CRIA chat platform.

πŸ“¦ Model Release

CRIA v1.3 comes with several variants.

This model is converted from the q4_0 GGML version of CRIA v1.3 using the llama.cpp's convert-llama-ggml-to-gguf.py script

πŸ”§ Training

It was trained on a Google Colab notebook with a T4 GPU and high RAM.

Training procedure

The following bitsandbytes quantization config was used during training:

  • load_in_8bit: False
  • load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: nf4
  • bnb_4bit_use_double_quant: False
  • bnb_4bit_compute_dtype: float16

Framework versions

  • PEFT 0.4.0

πŸ’» Usage

This model was converted to MLX format from davzoku/cria-llama2-7b-v1.3.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("davzoku/cria-llama2-7b-v1.3-q4-mlx")
response = generate(model, tokenizer, prompt="hello", verbose=True)

Original Usage

# pip install transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "davzoku/cria-llama2-7b-v1.3"
prompt = "What is a cria?"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    f'<s>[INST] {prompt} [/INST]',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

References

We'd like to thank:

  • mlabonne for his article and resources on implementation of instruction tuning
  • TheBloke for his script for LLM quantization.
Downloads last month
29
Safetensors
Model size
1.16B params
Tensor type
FP16
Β·
U32
Β·
Inference Examples
Inference API (serverless) has been turned off for this model.

Dataset used to train davzoku/cria-llama2-7b-v1.3-q4-mlx

Collection including davzoku/cria-llama2-7b-v1.3-q4-mlx