nicholasKluge
commited on
Commit
•
444939e
1
Parent(s):
eabd984
Update README.md
Browse files
README.md
CHANGED
@@ -33,26 +33,24 @@ co2_eq_emissions:
|
|
33 |
geographical_location: Germany
|
34 |
hardware_used: NVIDIA A100-SXM4-40GB
|
35 |
---
|
36 |
-
#
|
37 |
|
38 |
<img src="./logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
|
39 |
|
40 |
-
|
41 |
|
42 |
-
|
43 |
|
44 |
-
|
45 |
|
46 |
-
|
47 |
-
|
48 |
-
- **Custom Portuguese Dataset:** Teeny-tiny-llama has been trained on a custom Portuguese dataset. This dataset includes diverse linguistic contexts and preference pre-training, allowing the model to better cater to Portuguese language nuances and be better suited for fine-tuning tasks like instruction-tuning.
|
49 |
-
|
50 |
-
This repository has 21 checkpoints, saved as revisions, that were logged during the model's training.
|
51 |
|
52 |
## Details
|
53 |
|
|
|
54 |
- **Size:** 162,417,408 million parameters
|
55 |
-
- **
|
|
|
56 |
- **Language:** Portuguese
|
57 |
- **Number of steps:** 457,969 (3.7B tokens)
|
58 |
- **GPU:** 1 NVIDIA A100-SXM4-40GB
|
@@ -60,7 +58,15 @@ This repository has 21 checkpoints, saved as revisions, that were logged during
|
|
60 |
- **Emissions:** 5.6 KgCO2 (Germany)
|
61 |
- **Total energy consumption:** 15.5 kWh
|
62 |
|
63 |
-
This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
## Training Set-up
|
66 |
|
|
|
33 |
geographical_location: Germany
|
34 |
hardware_used: NVIDIA A100-SXM4-40GB
|
35 |
---
|
36 |
+
# TeenyTinyLlama-162m
|
37 |
|
38 |
<img src="./logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
|
39 |
|
40 |
+
## Model Summary
|
41 |
|
42 |
+
Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a series of small foundational models trained on Portuguese._
|
43 |
|
44 |
+
TeenyTinyLlama is a compact language model based on the Llama 2 architecture ([TinyLlama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities while being resource-conscious.
|
45 |
|
46 |
+
Also, these models were trained by leveraging [scaling laws](https://arxiv.org/abs/2203.15556) to determine the optimal number of tokens per parameter while incorporating [preference pre-training](https://arxiv.org/abs/2112.00861).
|
|
|
|
|
|
|
|
|
47 |
|
48 |
## Details
|
49 |
|
50 |
+
- **Architecture:** a Transformer-based model pre-trained via causal language modeling
|
51 |
- **Size:** 162,417,408 million parameters
|
52 |
+
- **Context length:** 2048 tokens
|
53 |
+
- **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3) (6.2B tokens)
|
54 |
- **Language:** Portuguese
|
55 |
- **Number of steps:** 457,969 (3.7B tokens)
|
56 |
- **GPU:** 1 NVIDIA A100-SXM4-40GB
|
|
|
58 |
- **Emissions:** 5.6 KgCO2 (Germany)
|
59 |
- **Total energy consumption:** 15.5 kWh
|
60 |
|
61 |
+
This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. The main libraries used are:
|
62 |
+
|
63 |
+
- Transformers
|
64 |
+
- PyTorch
|
65 |
+
- Datasets
|
66 |
+
- Tokenizers
|
67 |
+
- Accelerate codecarbon sentencepiece
|
68 |
+
|
69 |
+
|
70 |
|
71 |
## Training Set-up
|
72 |
|