nicholasKluge commited on
Commit
24f4359
1 Parent(s): e7dca5d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -0
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - cnmoro/Instruct-PTBR-ENUS-11M
5
+ - graelo/wikipedia
6
+ - uonlp/CulturaX
7
+ - pablo-moreira/gpt4all-j-prompt-generations-pt
8
+ - eduagarcia/OSCAR-2301-pt_dedup
9
+ - eduagarcia/cc100-pt
10
+ - iara-project/news-articles-ptbr-dataset
11
+ - MBZUAI/Bactrian-X
12
+ - Gustrd/dolly-15k-libretranslate-pt
13
+ - heloisy/cosmos_qa_ptbr
14
+ - maritaca-ai/imdb_pt
15
+ - squad_v1_pt
16
+ - celsowm/conjur_artigos
17
+ - celsowm/ambito_juridico_artigos
18
+ - arubenruben/cnn_dailymail_azure_pt_pt
19
+ - bigscience-data/roots_pt_wikiquote
20
+ - bigscience-data/roots_pt_ted_talks_iwslt
21
+ language:
22
+ - pt
23
+ metrics:
24
+ - perplexity
25
+ library_name: transformers
26
+ pipeline_tag: text-generation
27
+ tags:
28
+ - text-generation-inference
29
+ widget:
30
+ - text: "Astronomia é uma ciência natural que estuda"
31
+ example_title: Exemplo
32
+ - text: "Em um achado chocante, o cientista descobriu um"
33
+ example_title: Exemplo
34
+ - text: "Python é uma linguagem de"
35
+ example_title: Exemplo
36
+ - text: "O Gato de Schrödinger é uma experiência mental"
37
+ example_title: Exemplo
38
+ inference:
39
+ parameters:
40
+ repetition_penalty: 1.5
41
+ temperature: 0.5
42
+ top_k: 50
43
+ top_p: 0.5
44
+ max_new_tokens: 200
45
+ co2_eq_emissions:
46
+ emissions: 15
47
+ source: CodeCarbon
48
+ training_type: pre-training
49
+ geographical_location: Germany
50
+ hardware_used: NVIDIA A100-SXM4-40GB
51
+ ---
52
+ # Teeny-tiny-llama-162m (Portuguese)
53
+
54
+ <img src="https://github.com/Nkluge-correa/Aira/blob/master/Teeny-tiny-llama/logo/logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="400">
55
+
56
+ Teeny-tiny-llama-162m is a compact language model based on the Llama 2 architecture ([Tiny-llama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities (in Portuguese-BR) while being resource-conscious.
57
+
58
+ Teeny-tiny-llama has been trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training.
59
+
60
+ ## Features
61
+
62
+ - **Compact Design:** Teeny-tiny-llama is a downsized version of the Llama 2 architecture, making it suitable for applications with limited computational resources.
63
+
64
+ - **Optimized Scaling:** The model has been pre-trained using scaling logs to identify the ideal token-to-parameter ratio.
65
+
66
+ - **Custom Portuguese Dataset:** Teeny-tiny-llama has been trained on a custom Portuguese dataset. This dataset includes diverse linguistic contexts and preference pre-training, allowing the model to better cater to Portuguese language nuances and be better suited for fine-tuning tasks like instruction-tuning.
67
+
68
+ - ## Details
69
+
70
+ - **Size:** 162 million parameters
71
+ - **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3)
72
+ - **Language:** Portuguese
73
+ - **Number of steps:** 457969
74
+ - **Batch size:** 4
75
+ - **Optimizer:** `torch.optim.AdamW` (warmup_ratio = 0.01, learning_rate = 6e-4, epsilon = 1e-8)
76
+ - **GPU:** 1 NVIDIA A100-SXM4-40GB
77
+ - **Training time**: ~ 36 hours
78
+ - **Emissions:** 15 KgCO2 (Germany)
79
+ - **Total Energy Consumption:** 42 kWh
80
+
81
+ This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model.
82
+
83
+ ## Training Set-up
84
+
85
+ | Section | Setting | Value |
86
+ |----------------|-----------------------------|--------------------------------------|
87
+ | Model args. | vocab_size | 32000 |
88
+ | | hidden_size | 768 |
89
+ | | intermediate_size | 3072 |
90
+ | | max_position_embeddings | 2048 |
91
+ | | num_attention_heads | 12 |
92
+ | | num_hidden_layers | 12 |
93
+ | | num_key_value_heads | 12 |
94
+ | | torch_dtype | "float32" * |
95
+ | Data args. | dataset_name | "nicholasKluge/portuguese-corpus-v3" |
96
+ | | dataset_split | "train" |
97
+ | | train_num_samples | 1831873 |
98
+ | | val_num_samples | 18000 |
99
+ | | block_size | 2048 |
100
+ | Training args. | evaluation_strategy | "steps" |
101
+ | | eval_steps | 100000 |
102
+ | | per_device_train_batch_size | 4 |
103
+ | | per_device_eval_batch_size | 4 |
104
+ | | gradient_accumulation_steps | 1 |
105
+ | | learning_rate | 0.0006 |
106
+ | | adam_epsilon | 0.00000001 |
107
+ | | weight_decay | 0.01 |
108
+ | | lr_scheduler_type | "cosine" |
109
+ | | warmup_ratio | 0.01 |
110
+ | | num_train_epochs | 1 |
111
+ | | gradient_checkpointing | false |
112
+ | | seed | 42 |
113
+ | | wandb_log_steps | 1 |
114
+ | | mixed_precision | 'no' |
115
+ | | checkpointing_steps | 22000 |
116
+
117
+ * With `tf32` enabled during training.
118
+
119
+ ## Usage
120
+
121
+ ```python
122
+ # Use a pipeline as a high-level helper
123
+ from transformers import pipeline
124
+
125
+ pipe = pipeline("text-generation", model="nicholasKluge/Teeny-tiny-llama-162m")
126
+
127
+ # Load model directly
128
+ from transformers import AutoTokenizer, AutoModelForCausalLM
129
+
130
+ tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m")
131
+ model = AutoModelForCausalLM.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m")
132
+ ```
133
+
134
+ ## Limitations
135
+
136
+ 🤥 Generative AI models, like LLMs used for text generation/conversation or GANs for image generation, can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, given the model's tendency to output hallucinations. Such models can generate deceptive visuals, human-like textual content, music, or combined media that might seem genuine at first glance.
137
+
138
+ 🤬 Machine learning systems can inherit social and historical stereotypes from the data used to train them. Given these biases, models can be prone to produce toxic content, that is, text, images, videos, or comments, that is harmful, offensive, or detrimental to individuals, groups, or communities. Also, models that automate decision-making can have biases against certain groups, affecting people based on sensitive attributes in an unjust manner.
139
+
140
+ ## Evaluations
141
+
142
+ | Models | Average | [ARC](https://arxiv.org/abs/1803.05457) | [Hellaswag](https://arxiv.org/abs/1905.07830) | [MMLU](https://arxiv.org/abs/2009.03300) | [TruthfulQA](https://arxiv.org/abs/2109.07958) |
143
+ |-------------------------------------------------------------------------------------|---------|-----------------------------------------|-----------------------------------------------|------------------------------------------|------------------------------------------------|
144
+ | [Gpt2-portuguese-small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 30.22 | 22.48 $\pm$ 0.01 | 29.62 $\pm$ 0.00 | 27.36 $\pm$ 0.00 | 41.44 $\pm$ 0.01 |
145
+
146
+ * Evaluations were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). Thanks to [Laiviet](https://github.com/laiviet/lm-evaluation-harness) for translating some of the tasks in the LM-Evaluation-Harness.
147
+
148
+ | Steps | Evaluation Loss | Perplexity | Total Energy Consumption |
149
+ |---------|-----------------|------------|--------------------------|
150
+ | 100.000 | 3.19 | 24.52 | 3.75 kWh |
151
+ | 200.000 | 3.02 | 20.58 | 7.51 kWh |
152
+ | 300.000 | 2.83 | 16.98 | 11.25 kWh |
153
+ | 400.000 | 2.79 | 16.41 | 30.20 kWh |
154
+
155
+ ## Cite as 🤗
156
+
157
+ ```latex
158
+
159
+ @misc{nicholas22llama,
160
+ doi = {10.5281/zenodo.6989727},
161
+ url = {https://huggingface.co/nicholasKluge/Teeny-tiny-llama-162m},
162
+ author = {Nicholas Kluge Corrêa},
163
+ title = {Teeny-tiny-llama},
164
+ year = {2023},
165
+ publisher = {HuggingFace},
166
+ journal = {HuggingFace repository},
167
+ }
168
+
169
+ ```
170
+
171
+ ## License
172
+
173
+ The `Teeny-tiny-llama-162m` is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.