license: mit
language:
- en
pipeline_tag: text2text-generation
MANTa-LM (small)
Pretrained MANTa-LM architecture as introduced in the paper MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling.
Model Details
Model Description
The MANTa tokenizer aims at mimicking the combination of a subword tokenizer and an embedding matrix in a classical language model in a differentiable way. This trainable tokenizer is thus added as the first layer of an encoder-decoder model and trained using the language modeling objective.
Our results show that MANTa-LM only slightly degrades the performance of a T5 equivalent on the GLUE benchmark while being much more robust to artificial and user-generated noise.
Model Sources
- Paper: MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling (EMNLP 2022 Findings)
Uses
Direct Use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("almanach/manta-lm-small", trust_remote_code=True)
manta_model = AutoModelForSeq2SeqLM.from_pretrained("almanach/manta-lm-small", trust_remote_code=True)
tokens = tokenizer("The name of the capital of France is <extra_id_0> and it is a very big city.", return_tensors="pt")
output = manta_model.generate(**tokens, decoder_start_token_id=0, repetition_penalty=1.5, do_sample=True)
print(tokenizer.batch_decode(output))
Recommendations
We recommend using a smaller learning rate for the tokenizer module during fine-tuning (byte embeddings, frontier predictor, pooler).
Training Details
Training Data
This model was trained on the C4 dataset.
Training Procedure
The training objective is the same as ByT5, but most hyperparameters are taken from T5.
Citation
BibTeX:
@inproceedings{godey-etal-2022-manta,
title = "{MANT}a: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling",
author = "Godey, Nathan and
Castagn{\'e}, Roman and
de la Clergerie, {\'E}ric and
Sagot, Beno{\^\i}t",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.207",
pages = "2859--2870",
}