nicholasKluge commited on
Commit
de7cac6
1 Parent(s): 71c6ef0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -0
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - christykoh/imdb_pt
5
+ language:
6
+ - pt
7
+ metrics:
8
+ - accuracy
9
+ library_name: transformers
10
+ pipeline_tag: text-classification
11
+ tags:
12
+ - sentiment-analysis
13
+ widget:
14
+ - text: "Esqueceram de mim 2 é um dos melhores filmes de natal de todos os tempos."
15
+ example_title: Exemplo
16
+ - text: "Esqueceram de mim 2 é o pior filme da franquia inteira."
17
+ example_title: Exemplo
18
+ ---
19
+ # TeenyTinyLlama-460m-IMDB
20
+
21
+ TeenyTinyLlama is a series of small foundational models trained in Brazilian Portuguese.
22
+
23
+ This repository contains a version of [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) (`TeenyTinyLlama-460m-IMDB`) fine-tuned on the the [IMDB dataset](https://huggingface.co/datasets/christykoh/imdb_pt).
24
+
25
+ ## Details
26
+
27
+ - **Number of Epochs:** 3
28
+ - **Batch size:** 16
29
+ - **Optimizer:** `torch.optim.AdamW` (learning_rate = 4e-5, epsilon = 1e-8)
30
+ - **GPU:** 1 NVIDIA A100-SXM4-40GB
31
+
32
+ ## Usage
33
+
34
+ Using `transformers.pipeline`:
35
+
36
+ ```python
37
+ from transformers import pipeline
38
+
39
+ text = "Esqueceram de mim 2 é um dos melhores filmes de natal de todos os tempos."
40
+
41
+ classifier = pipeline("text-classification", model="nicholasKluge/TeenyTinyLlama-460m-IMDB")
42
+ classifier(text)
43
+
44
+ # >>> [{'label': 'POSITIVE', 'score': 0.9971244931221008}]
45
+ ```
46
+
47
+ ## Reproducing
48
+
49
+ To reproduce the fine-tuning process, use the following code snippet:
50
+
51
+ ```python
52
+ # IMDB
53
+ ! pip install transformers datasets evaluate accelerate -q
54
+
55
+ import evaluate
56
+ import numpy as np
57
+ from datasets import load_dataset
58
+ from transformers import AutoTokenizer, DataCollatorWithPadding
59
+ from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
60
+
61
+ # Load the task
62
+ dataset = load_dataset("christykoh/imdb_pt")
63
+
64
+ # Create a `ModelForSequenceClassification`
65
+ model = AutoModelForSequenceClassification.from_pretrained(
66
+ "nicholasKluge/TeenyTinyLlama-460m",
67
+ num_labels=2,
68
+ id2label={0: "NEGATIVE", 1: "POSITIVE"},
69
+ label2id={"NEGATIVE": 0, "POSITIVE": 1}
70
+ )
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-460m")
73
+
74
+ # Preprocess the dataset
75
+ def preprocess_function(examples):
76
+ return tokenizer(examples["text"], truncation=True, max_length=256)
77
+
78
+ dataset_tokenized = dataset.map(preprocess_function, batched=True)
79
+
80
+ # Create a simple data collactor
81
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
82
+
83
+ # Use accuracy as an evaluation metric
84
+ accuracy = evaluate.load("accuracy")
85
+
86
+ # Function to compute accuracy
87
+ def compute_metrics(eval_pred):
88
+ predictions, labels = eval_pred
89
+ predictions = np.argmax(predictions, axis=1)
90
+ return accuracy.compute(predictions=predictions, references=labels)
91
+
92
+ # Define training arguments
93
+ training_args = TrainingArguments(
94
+ output_dir="checkpoints",
95
+ learning_rate=4e-5,
96
+ per_device_train_batch_size=16,
97
+ per_device_eval_batch_size=16,
98
+ num_train_epochs=3,
99
+ weight_decay=0.01,
100
+ evaluation_strategy="epoch",
101
+ save_strategy="epoch",
102
+ load_best_model_at_end=True,
103
+ push_to_hub=True,
104
+ hub_token="your_token_here",
105
+ hub_model_id="username/model-name-imdb"
106
+ )
107
+
108
+ # Define the Trainer
109
+ trainer = Trainer(
110
+ model=model,
111
+ args=training_args,
112
+ train_dataset=dataset_tokenized["train"],
113
+ eval_dataset=dataset_tokenized["test"],
114
+ tokenizer=tokenizer,
115
+ data_collator=data_collator,
116
+ compute_metrics=compute_metrics,
117
+ )
118
+
119
+ # Train!
120
+ trainer.train()
121
+ ```
122
+
123
+ ## Fine-Tuning Comparisons
124
+
125
+ | Models | [IMDB](https://huggingface.co/datasets/christykoh/imdb_pt) |
126
+ |--------------------------------------------------------------------------------------------|------------------------------------------------------------|
127
+ | [Bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased)| 93.58 |
128
+ | [Teeny Tiny Llama 460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) | 92.28 |
129
+ | [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 92.22 |
130
+ | [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 91.60 |
131
+ | [Teeny Tiny Llama 160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) | 91.14 |
132
+
133
+ ## Cite as 🤗
134
+
135
+ ```latex
136
+
137
+ @misc{nicholas22llama,
138
+ doi = {10.5281/zenodo.6989727},
139
+ url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m},
140
+ author = {Nicholas Kluge Corrêa},
141
+ title = {TeenyTinyLlama},
142
+ year = {2023},
143
+ publisher = {HuggingFace},
144
+ journal = {HuggingFace repository},
145
+ }
146
+
147
+ ```
148
+
149
+ ## Funding
150
+
151
+ This repository was built as part of the RAIES ([Rede de Inteligência Artificial Ética e Segura](https://www.raies.org/)) initiative, a project supported by FAPERGS - ([Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul](https://fapergs.rs.gov.br/inicial)), Brazil.
152
+
153
+ ## License
154
+
155
+ TeenyTinyLlama-460m-IMDB is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.