ModernBART wen?
Title is /j, but in all seriousness is there any interest out there in producing a BART/T5-like encoder-decoder model with the improvements here? (flash attn, rope, etc)
(misclick xD)
The encoder-decoder models could even use the current checkpoint, if modernBERT is supported:
https://github.com/huggingface/transformers/issues/35385
https://discuss.huggingface.co/t/training-modernbert-gpt2/134398/2
The encoder-decoder models could even use the current checkpoint, if modernBERT is supported:
https://github.com/huggingface/transformers/issues/35385
https://discuss.huggingface.co/t/training-modernbert-gpt2/134398/2
Similarly, it would be nice if they added support for llama codebase/arch to be used as the decoder in EncoderDecoder models, so that smolLM2 etc. could be used. Since modernBERT's tokenizer is based on Olmo's, adding support for Olmo would also be good, it might be possible to use only 1 tokenizer for encoding and decoding with Olmo 1b as the decoder, etc.
Title is /j, but in all seriousness is there any interest out there in producing a BART/T5-like encoder-decoder model with the improvements here? (flash attn, rope, etc)
I've messed around a bit in creating a (more) modern T5 with better data, ctx length, tokenizer, etc with medium-ish results. The improvements were decent, and it might need more scaling in terms of data/compute etc, but the prelim results didn't impress me enough to invest in that yet. You can find some of them here. Note that the core T5 architecture is the same, so no custom code needed
- codebase I used for pretraining: https://github.com/pszemraj/nanoT5/tree/fineweb-edu-test
- other codebases worth looking at that built on nanoT5 implementing some more substantial updates to the arch: https://github.com/catie-aq/flashT5 and https://github.com/Knowledgator/TurboT5
if anyone is interested in collaborating on encoder-decoder model updates/pretraining feel free to reach out on Discord (username is same as my hf)
Good to see improvements of T5s! However, I think the main drawback is that the training objective with single token prediction is weak for the decoder (see Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models). UL2 can improve this, yet it was only used on top of some suboptimal trained models.
For encoder-decoder models, it is ideal that the encoder and decoder are the same type. We can choose from many supported models for the decoder including Llama:
BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MptConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
In theory, it should be possible to use smolLM as the decoder. Nevertheless, it isn't good to use pretrained decoder-only models (see Leveraging Pre-trained Checkpoints for Sequence Generation Tasks).
In theory, it should be possible
Bold of you to assume that it works without trying the code yourself! Perhaps people's assumptions that EncoderDecoderModel
works based on some config printouts explains why its nonfunctional for the vast majority of cases (except for models from 2020 and earlier). If you try running the following code with transformers 4.48.0
from transformers import EncoderDecoderModel, AutoTokenizer
tokenizer_enc = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tookenizer_dec = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")
model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "HuggingFaceTB/SmolLM2-360M-Instruct")
you will quickly see:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-86125bb7a061> in <cell line: 0>()
3 tokenizer_enc = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
4 tookenizer_dec = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")
----> 5 model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "HuggingFaceTB/SmolLM2-360M-Instruct")
1 frames
/usr/local/lib/python3.11/dist-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py in from_encoder_decoder_pretrained(cls, encoder_pretrained_model_name_or_path, decoder_pretrained_model_name_or_path, *model_args, **kwargs)
540 # instantiate config with corresponding kwargs
541 config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config, **kwargs)
--> 542 return cls(encoder=encoder, decoder=decoder, config=config)
543
544 @add_start_docstrings_to_model_forward(ENCODER_DECODER_INPUTS_DOCSTRING)
/usr/local/lib/python3.11/dist-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py in __init__(self, config, encoder, decoder)
256 decoder_signature = set(inspect.signature(self.decoder.forward).parameters.keys())
257 if "encoder_hidden_states" not in decoder_signature:
--> 258 raise ValueError(
259 "The selected decoder is not prepared for the encoder hidden states to be passed. Please see the "
260 "following discussion on GitHub: https://github.com/huggingface/transformers/issues/23350"
ValueError: The selected decoder is not prepared for the encoder hidden states to be passed. Please see the following discussion on GitHub: https://github.com/huggingface/transformers/issues/23350
- at least
model = EncoderDecoderModel.from_encoder_decoder_pretrained("answerdotai/ModernBERT-base", "gpt2")
allows itself to be instantiated, havent tested further use
On a related note, this would be a great exhibit in a blog post/paper How Delayed and Hidden Errors Impede Research Progress in Transformers
Edit: The models, model sizes, and tasks evaluated in "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks" are outdated. Before making decisions related to those findings, it would be important to use modern datasets/tasks like FLAN (or better) and evaluate on real benchmarks (like the FLAN-t5 models are now). Also, since Roberta2GPT performs best on one of their tasks and fairly well on the other, combined with the proven success of VisionEncoderDecoder + gpt-like decoders (Nougat, Vit-gpt2). I would remain highly skeptical that "it isn't good to use pretrained decoder-only models"