TeenyTinyLlama-160m / README.md
nicholasKluge's picture
Update README.md
b1035cf
|
raw
history blame
11.8 kB
metadata
license: apache-2.0
datasets:
  - nicholasKluge/portuguese-corpus-v3
language:
  - pt
metrics:
  - perplexity
library_name: transformers
pipeline_tag: text-generation
tags:
  - text-generation-inference
widget:
  - text: Astronomia é uma ciência natural que estuda
    example_title: Exemplo
  - text: Em um achado chocante, o cientista descobriu um
    example_title: Exemplo
  - text: Python é uma linguagem de
    example_title: Exemplo
  - text: O Gato de Botas é conhecido por
    example_title: Exemplo
inference:
  parameters:
    repetition_penalty: 1.2
    temperature: 0.2
    top_k: 20
    top_p: 0.2
    max_new_tokens: 150
co2_eq_emissions:
  emissions: 5.6
  source: CodeCarbon
  training_type: pre-training
  geographical_location: Germany
  hardware_used: NVIDIA A100-SXM4-40GB

TeenyTinyLlama-162m

A little llama wearing a mushroom hat and a monocle.

Model Summary

Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: a series of small foundational models trained on Portuguese.

TeenyTinyLlama is a compact language model based on the Llama 2 architecture (TinyLlama implementation). This model is designed to deliver efficient natural language processing capabilities while being resource-conscious.

Also, these models were trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training.

Details

  • Architecture: a Transformer-based model pre-trained via causal language modeling
  • Size: 162,417,408 million parameters
  • Context length: 2048 tokens
  • Dataset: Portuguese-Corpus-v3 (6.2B tokens)
  • Language: Portuguese
  • Number of steps: 457,969 (3.7B tokens)
  • GPU: 1 NVIDIA A100-SXM4-40GB
  • Training time: ~ 36 hours
  • Emissions: 5.6 KgCO2 (Germany)
  • Total energy consumption: 15.5 kWh

This repository has the source code used to train this model. The main libraries used are:

Training Set-up

These are the main arguments used in the training of this model:

Arguments Value
vocabulary size 32000
hidden dimension size 768
intermediate dimension size 3072
context length 2048
nº attention heads 12
nº hidden layers 12
nº key value heads 12
nº training samples 1831873
nº validation samples 18000
nº epochs 1
evaluation steps 100000
train batch size 4
eval batch size 4
gradient accumulation steps 1
optimizer torch.optim.AdamW
learning rate 0.0006
adam epsilon 0.00000001
weight decay 0.01
scheduler type "cosine"
warmup ratio 0.01
gradient checkpointing false
seed 42
mixed precision 'no'
torch dtype "float32"
tf32 true

Basic usage

Using the pipeline:

from transformers import pipeline

generator = pipeline("text-generation", model="nicholasKluge/Teeny-tiny-llama-162m")

completions  = generator("Astronomia é a ciência", num_return_sequences=2, max_new_tokens=100)

for comp in completions:
  print(f"🤖 {comp['generated_text']}")

Using the AutoTokenizer and AutoModelForCausalLM:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and the tokenizer
tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m", revision='main')
model = AutoModelForCausalLM.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m", revision='main')

# Pass the model to your device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.eval()
model.to(device)

# Tokenize the inputs and pass them to the device
inputs = tokenizer("Astronomia é a ciência", return_tensors="pt").to(device)

# Generate some text
completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100)

# Print the generated text
for i, completion in enumerate(completions):
    print(f'🤖 {tokenizer.decode(completion)}')

Limitations

  • Hallucinations: This model can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, i.e., hallucination.

  • Biases and Toxicity: This model inherits the social and historical stereotypes from the data used to train it. Given these biases, the model can produce toxic content, i.e., harmful, offensive, or detrimental to individuals, groups, or communities.

  • Unreliable Code: The model may produce incorrect code snippets and statements. These code generations should not be treated as suggestions or accurate solutions.

  • Language Limitations: The model is primarily designed to understand standard Portuguese (BR). Other languages might challenge its comprehension, leading to potential misinterpretations or errors in response.

  • Repetition and Verbosity: The model may get stuck on repetition loops (especially if the repetition penalty during generations is set to a meager value) or produce verbose responses unrelated to the prompt it was given.

Evaluations

Steps Evaluation Loss Perplexity Total Energy Consumption Emissions
100.000 3.19 24.52 3.75 kWh 1.28 CO2eq
200.000 3.02 20.58 7.51 kWh 2.56 CO2eq
300.000 2.83 16.98 11.25 kWh 3.84 CO2eq
400.000 2.79 16.41 14.52 kWh 5.11 CO2eq

Benchmarks

Models Average ARC Hellaswag MMLU TruthfulQA
TeenyTinyLlama-162m 31.16 26.15 29.29 28.11 41.12
Pythia-160m 31.16 24.06 31.39 24.86 44.34
OPT-125m 30.80 22.87 31.47 26.02 42.87
Gpt2-portuguese-small 30.22 22.48 29.62 27.36 41.44
Gpt2-small 29.97 21.48 31.60 25.79 40.65

Fine Tuning

Models IMDB FaQuAD-NLI HateBr
Teeny Tiny Llama 162m 91.14 90.00 90.71
Bert-base-portuguese-cased 92.22 93.07 91.28
Gpt2-small-portuguese 91.60 86.46 87.42

Cite as 🤗


@misc{nicholas22llama,
  doi = {10.5281/zenodo.6989727},
  url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m},
  author = {Nicholas Kluge Corrêa},
  title = {Teeny Tiny Llama},
  year = {2023},
  publisher = {HuggingFace},
  journal = {HuggingFace repository},
}

License

The TeenyTinyLlama-162m is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.