--- license: apache-2.0 datasets: - nicholasKluge/portuguese-corpus-v3 language: - pt metrics: - perplexity library_name: transformers pipeline_tag: text-generation tags: - text-generation-inference widget: - text: Astronomia é uma ciência natural que estuda example_title: Exemplo - text: Em um achado chocante, o cientista descobriu um example_title: Exemplo - text: Python é uma linguagem de example_title: Exemplo - text: O Gato de Schrödinger é uma experiência mental example_title: Exemplo inference: parameters: repetition_penalty: 1.5 temperature: 0.3 top_k: 30 top_p: 0.3 max_new_tokens: 200 co2_eq_emissions: emissions: 5.6 source: CodeCarbon training_type: pre-training geographical_location: Germany hardware_used: NVIDIA A100-SXM4-40GB --- # Teeny-tiny-llama-162m (Portuguese) A little llama wearing a mushroom hat and a monocle. Teeny-tiny-llama-162m is a compact language model based on the Llama 2 architecture ([Tiny-llama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities (in Portuguese-BR) while being resource-conscious. Teeny-tiny-llama has been trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training. - **Compact Design:** Teeny-tiny-llama is a downsized version of the Llama 2 architecture, making it suitable for applications with limited computational resources. - **Optimized Scaling:** The model has been pre-trained using scaling logs to identify the ideal token-to-parameter ratio. - **Custom Portuguese Dataset:** Teeny-tiny-llama has been trained on a custom Portuguese dataset. This dataset includes diverse linguistic contexts and preference pre-training, allowing the model to better cater to Portuguese language nuances and be better suited for fine-tuning tasks like instruction-tuning. This repository has 21 checkpoints, saved as revisions, that were logged during the model's training. ## Details - **Size:** 162,417,408 million parameters - **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3) - **Language:** Portuguese - **Number of steps:** 457,969 - **Batch size:** 4 - **Optimizer:** `torch.optim.AdamW` (warmup_ratio = 0.01, learning_rate = 6e-4, epsilon = 1e-8) - **GPU:** 1 NVIDIA A100-SXM4-40GB - **Training time**: ~ 36 hours - **Emissions:** 5.6 KgCO2 (Germany) - **Total Energy Consumption:** 15.5 kWh This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. ## Training Set-up | Section | Setting | Value | |----------------|-----------------------------|--------------------------------------| | Model args. | vocab_size | 32000 | | | hidden_size | 768 | | | intermediate_size | 3072 | | | max_position_embeddings | 2048 | | | num_attention_heads | 12 | | | num_hidden_layers | 12 | | | num_key_value_heads | 12 | | | torch_dtype | "float32" | | Data args. | dataset_name | "nicholasKluge/portuguese-corpus-v3" | | | dataset_split | "train" | | | train_num_samples | 1831873 | | | val_num_samples | 18000 | | | block_size | 2048 | | Training args. | evaluation_strategy | "steps" | | | eval_steps | 100000 | | | per_device_train_batch_size | 4 | | | per_device_eval_batch_size | 4 | | | gradient_accumulation_steps | 1 | | | learning_rate | 0.0006 | | | adam_epsilon | 0.00000001 | | | weight_decay | 0.01 | | | lr_scheduler_type | "cosine" | | | warmup_ratio | 0.01 | | | num_train_epochs | 1 | | | gradient_checkpointing | false | | | seed | 42 | | | mixed_precision | 'no' | | | checkpointing_steps | 22000 | | | tf32 | true | ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and the tokenizer tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m", revision='main') model = AutoModelForCausalLM.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m", revision='main') # Pass the model to your device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.eval() model.to(device) # Tokenize the inputs and pass them to the device inputs = tokenizer("Astronomia é a ciência", return_tensors="pt").to(device) # Generate some text completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100) # Print the generated text for i, completion in enumerate(completions): print(f'🤖 {tokenizer.decode(completion)}') >>> 🤖 Astronomia é a ciência que estuda o universo e as leis da física e suas relações com os fenômenos naturais e seus efeitos sobre o meio ambiente e o homem. A astronomia é uma disciplina científica que se dedica à investigação de fenômenos astronômicos e ao estudo das propriedades dos objetos celestes. ``` ## Limitations 🤥 Generative AI models, like LLMs used for text generation/conversation or GANs for image generation, can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, given the model's tendency to output hallucinations. Such models can generate deceptive visuals, human-like textual content, music, or combined media that might seem genuine at first glance. 🤬 Machine learning systems can inherit social and historical stereotypes from the data used to train them. Given these biases, models can be prone to produce toxic content, that is, text, images, videos, or comments, that is harmful, offensive, or detrimental to individuals, groups, or communities. Also, models that automate decision-making can have biases against certain groups, affecting people based on sensitive attributes in an unjust manner. ## Evaluations | Steps | Evaluation Loss | Perplexity | Total Energy Consumption | Emissions | |---------|-----------------|------------|--------------------------|------------| | 100.000 | 3.19 | 24.52 | 3.75 kWh | 1.28 CO2eq | | 200.000 | 3.02 | 20.58 | 7.51 kWh | 2.56 CO2eq | | 300.000 | 2.83 | 16.98 | 11.25 kWh | 3.84 CO2eq | | 400.000 | 2.79 | 16.41 | 14.52 kWh | 5.11 CO2eq | ## Benchmarks | Models | Average | [ARC](https://arxiv.org/abs/1803.05457) | [Hellaswag](https://arxiv.org/abs/1905.07830) | [MMLU](https://arxiv.org/abs/2009.03300) | [TruthfulQA](https://arxiv.org/abs/2109.07958) | |-------------------------------------------------------------------------------------|---------|-----------------------------------------|-----------------------------------------------|------------------------------------------|------------------------------------------------| | [Gpt2-portuguese-small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 30.22 | 22.48 | 29.62 | 27.36 | 41.44 | * Evaluations on benchmarks were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). Thanks to [Laiviet](https://github.com/laiviet/lm-evaluation-harness) for translating some of the tasks in the LM-Evaluation-Harness. ## Cite as 🤗 ```latex @misc{nicholas22llama, doi = {10.5281/zenodo.6989727}, url = {https://huggingface.co/nicholasKluge/Teeny-tiny-llama-162m}, author = {Nicholas Kluge Corrêa}, title = {Teeny-tiny-llama}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, } ``` ## License The `Teeny-tiny-llama-162m` is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.