File size: 9,254 Bytes
24f4359
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
license: apache-2.0
datasets:
- cnmoro/Instruct-PTBR-ENUS-11M
- graelo/wikipedia
- uonlp/CulturaX
- pablo-moreira/gpt4all-j-prompt-generations-pt
- eduagarcia/OSCAR-2301-pt_dedup
- eduagarcia/cc100-pt
- iara-project/news-articles-ptbr-dataset
- MBZUAI/Bactrian-X
- Gustrd/dolly-15k-libretranslate-pt
- heloisy/cosmos_qa_ptbr
- maritaca-ai/imdb_pt
- squad_v1_pt
- celsowm/conjur_artigos
- celsowm/ambito_juridico_artigos
- arubenruben/cnn_dailymail_azure_pt_pt
- bigscience-data/roots_pt_wikiquote
- bigscience-data/roots_pt_ted_talks_iwslt
language:
- pt
metrics:
- perplexity
library_name: transformers
pipeline_tag: text-generation
tags:
- text-generation-inference
widget:
- text: "Astronomia é uma ciência natural que estuda"
  example_title: Exemplo
- text: "Em um achado chocante, o cientista descobriu um"
  example_title: Exemplo
- text: "Python é uma linguagem de"
  example_title: Exemplo
- text: "O Gato de Schrödinger é uma experiência mental"
  example_title: Exemplo
inference:
  parameters:
    repetition_penalty: 1.5
    temperature: 0.5
    top_k: 50
    top_p: 0.5
    max_new_tokens: 200
co2_eq_emissions:
  emissions: 15
  source: CodeCarbon
  training_type: pre-training
  geographical_location: Germany
  hardware_used: NVIDIA A100-SXM4-40GB
---
# Teeny-tiny-llama-162m (Portuguese)

<img src="https://github.com/Nkluge-correa/Aira/blob/master/Teeny-tiny-llama/logo/logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="400">

Teeny-tiny-llama-162m is a compact language model based on the Llama 2 architecture ([Tiny-llama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities (in Portuguese-BR) while being resource-conscious.

Teeny-tiny-llama has been trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training.

## Features

- **Compact Design:** Teeny-tiny-llama is a downsized version of the Llama 2 architecture, making it suitable for applications with limited computational resources.

- **Optimized Scaling:** The model has been pre-trained using scaling logs to identify the ideal token-to-parameter ratio.

- **Custom Portuguese Dataset:** Teeny-tiny-llama has been trained on a custom Portuguese dataset. This dataset includes diverse linguistic contexts and preference pre-training, allowing the model to better cater to Portuguese language nuances and be better suited for fine-tuning tasks like instruction-tuning.

- ## Details

- **Size:** 162 million parameters
- **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3)
- **Language:** Portuguese
- **Number of steps:** 457969
- **Batch size:** 4
- **Optimizer:** `torch.optim.AdamW` (warmup_ratio = 0.01, learning_rate = 6e-4, epsilon = 1e-8)
- **GPU:** 1 NVIDIA A100-SXM4-40GB
- **Training time**: ~ 36 hours
- **Emissions:** 15 KgCO2 (Germany)
- **Total Energy Consumption:** 42 kWh

This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model.

## Training Set-up

| Section        | Setting                     | Value                                |
|----------------|-----------------------------|--------------------------------------|
| Model args.    | vocab_size                  | 32000                                |
|                | hidden_size                 | 768                                  |
|                | intermediate_size           | 3072                                 |
|                | max_position_embeddings     | 2048                                 |
|                | num_attention_heads         | 12                                   |
|                | num_hidden_layers           | 12                                   |
|                | num_key_value_heads         | 12                                   |
|                | torch_dtype                 | "float32" *                          |
| Data args.     | dataset_name                | "nicholasKluge/portuguese-corpus-v3" |
|                | dataset_split               | "train"                              |
|                | train_num_samples           | 1831873                              |
|                | val_num_samples             | 18000                                |
|                | block_size                  | 2048                                 |
| Training args. | evaluation_strategy         | "steps"                              |
|                | eval_steps                  | 100000                               |
|                | per_device_train_batch_size | 4                                    |
|                | per_device_eval_batch_size  | 4                                    |
|                | gradient_accumulation_steps | 1                                    |
|                | learning_rate               | 0.0006                               |
|                | adam_epsilon                | 0.00000001                           |
|                | weight_decay                | 0.01                                 |
|                | lr_scheduler_type           | "cosine"                             |
|                | warmup_ratio                | 0.01                                 |
|                | num_train_epochs            | 1                                    |
|                | gradient_checkpointing      | false                                |
|                | seed                        | 42                                   |
|                | wandb_log_steps             | 1                                    |
|                | mixed_precision             | 'no'                                 |
|                | checkpointing_steps         | 22000                                |

* With `tf32` enabled during training.

## Usage

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nicholasKluge/Teeny-tiny-llama-162m")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m")
model = AutoModelForCausalLM.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m")
```

## Limitations

🤥 Generative AI models, like LLMs used for text generation/conversation or GANs for image generation, can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, given the model's tendency to output hallucinations. Such models can generate deceptive visuals, human-like textual content, music, or combined media that might seem genuine at first glance.

🤬 Machine learning systems can inherit social and historical stereotypes from the data used to train them. Given these biases, models can be prone to produce toxic content, that is, text, images, videos, or comments, that is harmful, offensive, or detrimental to individuals, groups, or communities. Also, models that automate decision-making can have biases against certain groups, affecting people based on sensitive attributes in an unjust manner.

## Evaluations

| Models                                                                              | Average | [ARC](https://arxiv.org/abs/1803.05457) | [Hellaswag](https://arxiv.org/abs/1905.07830) | [MMLU](https://arxiv.org/abs/2009.03300) | [TruthfulQA](https://arxiv.org/abs/2109.07958) |
|-------------------------------------------------------------------------------------|---------|-----------------------------------------|-----------------------------------------------|------------------------------------------|------------------------------------------------|
| [Gpt2-portuguese-small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 30.22   | 22.48 $\pm$ 0.01                        | 29.62 $\pm$ 0.00                              | 27.36 $\pm$ 0.00                         | 41.44 $\pm$ 0.01                               |

* Evaluations were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). Thanks to [Laiviet](https://github.com/laiviet/lm-evaluation-harness) for translating some of the tasks in the LM-Evaluation-Harness.

| Steps   | Evaluation Loss | Perplexity | Total Energy Consumption |
|---------|-----------------|------------|--------------------------|
| 100.000 | 3.19            | 24.52      | 3.75 kWh                 |
| 200.000 | 3.02            | 20.58      | 7.51 kWh                 |
| 300.000 | 2.83            | 16.98      | 11.25 kWh                |
| 400.000 | 2.79            | 16.41      | 30.20 kWh                |

## Cite as 🤗

```latex

@misc{nicholas22llama,
  doi = {10.5281/zenodo.6989727},
  url = {https://huggingface.co/nicholasKluge/Teeny-tiny-llama-162m},
  author = {Nicholas Kluge Corrêa},
  title = {Teeny-tiny-llama},
  year = {2023},
  publisher = {HuggingFace},
  journal = {HuggingFace repository},
}

```

## License

The `Teeny-tiny-llama-162m` is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.