File size: 3,562 Bytes
bcc4ec8 653df42 bcc4ec8 b67821c bcc4ec8 b67821c 88658b0 7822c00 81e17c0 792aeca b67821c 184eab2 c5747c7 184eab2 c5747c7 184eab2 b67821c 184eab2 b67821c 184eab2 b67821c 184eab2 b67821c 184eab2 b67821c 184eab2 b67821c 792aeca 184eab2 792aeca d8f7066 184eab2 ed774ef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
license: apache-2.0
datasets:
- christykoh/imdb_pt
language:
- pt
metrics:
- accuracy
library_name: transformers
pipeline_tag: text-classification
tags:
- sentiment-analysis
widget:
- text: "Esqueceram de mim 2 é um dos melhores filmes de natal de todos os tempos."
example_title: Exemplo
- text: "Esqueceram de mim 2 é o pior filme da franquia inteira."
example_title: Exemplo
---
# TeenyTinyLlama-162m-IMDB
TeenyTinyLlama is a series of small foundational models trained on Portuguese.
This repository contains a version of [TeenyTinyLlama-162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m) fine-tuned on a translated version of the [IMDB dataset](https://huggingface.co/datasets/christykoh/imdb_pt).
## Reproducing
```python
# IMDB
! pip install transformers datasets evaluate accelerate -q
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
# Load the task
dataset = load_dataset("christykoh/imdb_pt")
# Create a `ModelForSequenceClassification`
model = AutoModelForSequenceClassification.from_pretrained(
"nicholasKluge/TeenyTinyLlama-162m",
num_labels=2,
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-162m")
# Preprocess the dataset
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=256)
dataset_tokenized = dataset.map(preprocess_function, batched=True)
# Create a simple data collactor
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Use accuracy as an evaluation metric
accuracy = evaluate.load("accuracy")
# Function to compute accuracy
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
# Define training arguments
training_args = TrainingArguments(
output_dir="checkpoints",
learning_rate=4e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
hub_token="your_token_here",
hub_model_id="username/model-name-imdb"
)
# Define the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset_tokenized["train"],
eval_dataset=dataset_tokenized["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
# Train!
trainer.train()
```
## Results
| Models | [IMDB](https://huggingface.co/datasets/christykoh/imdb_pt) |
|--------------------------------------------------------------------------------------------|------------------------------------------------------------|
| [Teeny Tiny Llama 162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m) | 91.14 |
| [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 92.22 |
| [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 91.60 |
|