--- license: apache-2.0 datasets: - nicholasKluge/portuguese-corpus-v3 language: - pt metrics: - perplexity library_name: transformers pipeline_tag: text-generation tags: - text-generation-inference widget: - text: "Astronomia é uma ciência natural que estuda" example_title: Exemplo - text: "Em um achado chocante, o cientista descobriu um" example_title: Exemplo - text: "Python é uma linguagem de" example_title: Exemplo - text: "O Gato de Botas é conhecido por" example_title: Exemplo inference: parameters: repetition_penalty: 1.2 temperature: 0.2 top_k: 20 top_p: 0.2 max_new_tokens: 150 co2_eq_emissions: emissions: 5.6 source: CodeCarbon training_type: pre-training geographical_location: Germany hardware_used: NVIDIA A100-SXM4-40GB --- # TeenyTinyLlama-162m A little llama wearing a mushroom hat and a monocle. ## Model Summary Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a series of small foundational models trained on Portuguese._ TeenyTinyLlama is a compact language model based on the Llama 2 architecture ([TinyLlama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities while being resource-conscious. Also, these models were trained by leveraging [scaling laws](https://arxiv.org/abs/2203.15556) to determine the optimal number of tokens per parameter while incorporating [preference pre-training](https://arxiv.org/abs/2112.00861). ## Details - **Architecture:** a Transformer-based model pre-trained via causal language modeling - **Size:** 162,417,408 million parameters - **Context length:** 2048 tokens - **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3) (6.2B tokens) - **Language:** Portuguese - **Number of steps:** 457,969 (3.7B tokens) - **GPU:** 1 NVIDIA A100-SXM4-40GB - **Training time**: ~ 36 hours - **Emissions:** 5.6 KgCO2 (Germany) - **Total energy consumption:** 15.5 kWh This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. The main libraries used are: - [Transformers](https://github.com/huggingface/transformers) - [PyTorch](https://github.com/pytorch/pytorch) - [Datasets](https://github.com/huggingface/datasets) - [Tokenizers](https://github.com/huggingface/tokenizers) - [Accelerate](https://github.com/huggingface/accelerate) - [Codecarbon](https://github.com/mlco2/codecarbon) ## Training Set-up These are the main arguments used in the training of this model: | Arguments | Value | |-------------------------------|--------------------------------------| | vocabulary size | 32000 | | hidden dimension size | 768 | | intermediate dimension size | 3072 | | context length | 2048 | | nº attention heads | 12 | | nº hidden layers | 12 | | nº key value heads | 12 | | nº training samples | 1831873 | | nº validation samples | 18000 | | nº epochs | 1 | | evaluation steps | 100000 | | train batch size | 4 | | eval batch size | 4 | | gradient accumulation steps | 1 | | optimizer | torch.optim.AdamW | | learning rate | 0.0006 | | adam epsilon | 0.00000001 | | weight decay | 0.01 | | scheduler type | "cosine" | | warmup ratio | 0.01 | | gradient checkpointing | false | | seed | 42 | | mixed precision | 'no' | | torch dtype | "float32" | | tf32 | true | ## Basic usage Using the `pipeline`: ```python from transformers import pipeline generator = pipeline("text-generation", model="nicholasKluge/Teeny-tiny-llama-162m") completions = generator("Astronomia é a ciência", num_return_sequences=2, max_new_tokens=100) for comp in completions: print(f"🤖 {comp['generated_text']}") ``` Using the `AutoTokenizer` and `AutoModelForCausalLM`: ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and the tokenizer tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m", revision='main') model = AutoModelForCausalLM.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m", revision='main') # Pass the model to your device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.eval() model.to(device) # Tokenize the inputs and pass them to the device inputs = tokenizer("Astronomia é a ciência", return_tensors="pt").to(device) # Generate some text completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100) # Print the generated text for i, completion in enumerate(completions): print(f'🤖 {tokenizer.decode(completion)}') ``` ## Limitations - **Hallucinations:** This model can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, i.e., hallucination. - **Biases and Toxicity:** This model inherits the social and historical stereotypes from the data used to train it. Given these biases, the model can produce toxic content, i.e., harmful, offensive, or detrimental to individuals, groups, or communities. - **Unreliable Code:** The model may produce incorrect code snippets and statements. These code generations should not be treated as suggestions or accurate solutions. - **Language Limitations:** The model is primarily designed to understand standard Portuguese (BR). Other languages might challenge its comprehension, leading to potential misinterpretations or errors in response. - **Repetition and Verbosity:** The model may get stuck on repetition loops (especially if the repetition penalty during generations is set to a meager value) or produce verbose responses unrelated to the prompt it was given. ## Evaluations | Steps | Evaluation Loss | Perplexity | Total Energy Consumption | Emissions | |---------|-----------------|------------|--------------------------|------------| | 100.000 | 3.19 | 24.52 | 3.75 kWh | 1.28 CO2eq | | 200.000 | 3.02 | 20.58 | 7.51 kWh | 2.56 CO2eq | | 300.000 | 2.83 | 16.98 | 11.25 kWh | 3.84 CO2eq | | 400.000 | 2.79 | 16.41 | 14.52 kWh | 5.11 CO2eq | ## Benchmarks | Models | Average | [ARC](https://arxiv.org/abs/1803.05457) | [Hellaswag](https://arxiv.org/abs/1905.07830) | [MMLU](https://arxiv.org/abs/2009.03300) | [TruthfulQA](https://arxiv.org/abs/2109.07958) | |-------------------------------------------------------------------------------------|---------|-----------------------------------------|-----------------------------------------------|------------------------------------------|------------------------------------------------| | [TeenyTinyLlama-162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m) | 31.16 | 26.15 | 29.29 | 28.11 | 41.12 | | [Pythia-160m](https://huggingface.co/EleutherAI/pythia-160m-deduped) | 31.16 | 24.06 | 31.39 | 24.86 | 44.34 | | [OPT-125m](https://huggingface.co/facebook/opt-125m) | 30.80 | 22.87 | 31.47 | 26.02 | 42.87 | | [Gpt2-portuguese-small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 30.22 | 22.48 | 29.62 | 27.36 | 41.44 | | [Gpt2-small](https://huggingface.co/gpt2) | 29.97 | 21.48 | 31.60 | 25.79 | 40.65 | * Evaluations on benchmarks were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). Thanks to [Laiviet](https://github.com/laiviet/lm-evaluation-harness) for translating some of the tasks in the LM-Evaluation-Harness. ## Fine Tuning | Models | [IMDB](https://huggingface.co/datasets/christykoh/imdb_pt) | [FaQuAD-NLI](https://huggingface.co/datasets/ruanchaves/faquad-nli) | [HateBr](https://huggingface.co/datasets/ruanchaves/hatebr) | |--------------------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------|-------------------------------------------------------------| | [Teeny Tiny Llama 162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m) | 91.14 | 90.00 | 90.71 | | [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 92.22 | 93.07 | 91.28 | | [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 91.60 | 86.46 | 87.42 | ## Cite as 🤗 ```latex @misc{nicholas22llama, doi = {10.5281/zenodo.6989727}, url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m}, author = {Nicholas Kluge Corrêa}, title = {Teeny Tiny Llama}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, } ``` ## License The TeenyTinyLlama-162m is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.