metadata

license: apache-2.0
datasets:
  - cerebras/SlimPajama-627B
  - bigcode/starcoderdata
  - OpenAssistant/oasst_top1_2023-08-25
language:
  - en

GGUF Quantized version of TinyLlama on Sept 27th 2023

The model is not completed training yet, but still performs well.

This GGUF model is for inference with Llama.cpp

Original repo details below, from here

TinyLlama-1.1B

https://github.com/jzhang38/TinyLlama

The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.

We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.

This Model

This is the chat model finetuned on PY007/TinyLlama-1.1B-intermediate-step-240k-503b. The dataset used is OpenAssistant/oasst_top1_2023-08-25.

Update from V0.1: 1. Different dataset. 2. Different chat format (now chatml formatted conversations).

How to use

You will need the transformers>=4.31 Do check the TinyLlama github page for more information.

from transformers import AutoTokenizer
import transformers 
import torch
model = "PY007/TinyLlama-1.1B-Chat-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)
prompt = "How to get in a good university?"
formatted_prompt = (
    f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
)
sequences = pipeline(
    formatted_prompt,
    do_sample=True,
    top_k=50,
    top_p = 0.9,
    num_return_sequences=1,
    repetition_penalty=1.1,
    max_new_tokens=1024,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")