PORTULAN
/

albertina-900m-portuguese-ptbr-encoder-brwac

@@ -12,7 +12,7 @@ datasets:
 - brwac
 - europarl
 widget:
- - text: "A culinária brasileira é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país."
 ---
@@ -139,7 +139,7 @@ We address four tasks from those in PLUE, namely:
 | **Albertina-PT-PT** | **0.7960**     | 0.4507         | **0.9151**| 0.8799          |
-We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glueptpt), a **PT-PT version of the GLUE** benchmark.
 We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option.
 | Model               | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
@@ -156,15 +156,10 @@ You can use this model directly with a pipeline for masked language modeling:
 ```python
 >>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-ptpt')
->>> unmasker("A culinária portuguesa é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país.")
-[{'score': 0.9166129231452942, 'token': 23395, 'token_str': 'aromas', 'sequence': 'A culinária portuguesa é rica em sabores e aromas, tornando-se um dos maiores tesouros do país.'},
-{'score': 0.022932516410946846, 'token': 10392, 'token_str': 'costumes', 'sequence': 'A culinária portuguesa é rica em sabores e costumes, tornando-se um dos maiores tesouros do país.'},
-{'score': 0.013932268135249615, 'token': 21925, 'token_str': 'cores', 'sequence': 'A culinária portuguesa é rica em sabores e cores, tornando-se um dos maiores tesouros do país.'},
-{'score': 0.009870869107544422, 'token': 22647, 'token_str': 'nuances', 'sequence': 'A culinária portuguesa é rica em sabores e nuances, tornando-se um dos maiores tesouros do país.'},
-{'score': 0.007260020822286606, 'token': 12881, 'token_str': 'aroma', 'sequence': 'A culinária portuguesa é rica em sabores e aroma, tornando-se um dos maiores tesouros do país.'}]
 ```
@@ -174,16 +169,16 @@ The model can be used by fine-tuning it for a specific task:
 >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
 >>> from datasets import load_dataset
->>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-ptpt", num_labels=2)
->>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-ptpt")
->>> dataset = load_dataset("PORTULAN/glueptpt", "rte")
 >>> def tokenize_function(examples):
 ...     return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
 >>> tokenized_datasets = dataset.map(tokenize_function, batched=True)
->>> training_args = TrainingArguments(output_dir="albertina-ptpt-rte", evaluation_strategy="epoch")
 >>> trainer = Trainer(
 ...     model=model,
 ...     args=training_args,

 - brwac
 - europarl
 widget:
+ - text: "A culinária brasileira é rica em sabores e [MASK], tornando-se um dos maiores patrimônios do país."
 ---
 | **Albertina-PT-PT** | **0.7960**     | 0.4507         | **0.9151**| 0.8799          |
+We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glue-ptpt), a **PT-PT version of the GLUE** benchmark.
 We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option.
 | Model               | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
 ```python
 >>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-ptbr')
+>>> unmasker("A culinária brasileira é rica em sabores e [MASK], tornando-se um dos maiores patrimônios do país.")
+TODO
 ```
 >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
 >>> from datasets import load_dataset
+>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-ptbr", num_labels=2)
+>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-ptbr")
+>>> dataset = load_dataset("PORTULAN/glue-ptpt", "rte")
 >>> def tokenize_function(examples):
 ...     return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
 >>> tokenized_datasets = dataset.map(tokenize_function, batched=True)
+>>> training_args = TrainingArguments(output_dir="albertina-ptbr-rte", evaluation_strategy="epoch")
 >>> trainer = Trainer(
 ...     model=model,
 ...     args=training_args,