ModernBART wen?

#38
by Fizzarolli - opened

Title is /j, but in all seriousness is there any interest out there in producing a BART/T5-like encoder-decoder model with the improvements here? (flash attn, rope, etc)

Fizzarolli changed discussion status to closed
Fizzarolli changed discussion status to open

(misclick xD)

The encoder-decoder models could even use the current checkpoint, if modernBERT is supported:
https://github.com/huggingface/transformers/issues/35385
https://discuss.huggingface.co/t/training-modernbert-gpt2/134398/2

The encoder-decoder models could even use the current checkpoint, if modernBERT is supported:
https://github.com/huggingface/transformers/issues/35385
https://discuss.huggingface.co/t/training-modernbert-gpt2/134398/2

Similarly, it would be nice if they added support for llama codebase/arch to be used as the decoder in EncoderDecoder models, so that smolLM2 etc. could be used. Since modernBERT's tokenizer is based on Olmo's, adding support for Olmo would also be good, it might be possible to use only 1 tokenizer for encoding and decoding with Olmo 1b as the decoder, etc.

Title is /j, but in all seriousness is there any interest out there in producing a BART/T5-like encoder-decoder model with the improvements here? (flash attn, rope, etc)

I've messed around a bit in creating a (more) modern T5 with better data, ctx length, tokenizer, etc with medium-ish results. The improvements were decent, and it might need more scaling in terms of data/compute etc, but the prelim results didn't impress me enough to invest in that yet. You can find some of them here. Note that the core T5 architecture is the same, so no custom code needed

if anyone is interested in collaborating on encoder-decoder model updates/pretraining feel free to reach out on Discord (username is same as my hf)

Good to see improvements of T5s! However, I think the main drawback is that the training objective with single token prediction is weak for the decoder (see Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models). UL2 can improve this, yet it was only used on top of some suboptimal trained models.

For encoder-decoder models, it is ideal that the encoder and decoder are the same type. We can choose from many supported models for the decoder including Llama:

 BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MptConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

In theory, it should be possible to use smolLM as the decoder. Nevertheless, it isn't good to use pretrained decoder-only models (see Leveraging Pre-trained Checkpoints for Sequence Generation Tasks).

In theory, it should be possible

Bold of you to assume that it works without trying the code yourself! Perhaps people's assumptions that EncoderDecoderModel works based on some config printouts explains why its nonfunctional for the vast majority of cases (except for models from 2020 and earlier). If you try running the following code with transformers 4.48.0

from transformers import EncoderDecoderModel, AutoTokenizer

tokenizer_enc = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tookenizer_dec = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")
model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "HuggingFaceTB/SmolLM2-360M-Instruct")

you will quickly see:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-86125bb7a061> in <cell line: 0>()
      3 tokenizer_enc = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
      4 tookenizer_dec = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")
----> 5 model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "HuggingFaceTB/SmolLM2-360M-Instruct")

1 frames
/usr/local/lib/python3.11/dist-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py in from_encoder_decoder_pretrained(cls, encoder_pretrained_model_name_or_path, decoder_pretrained_model_name_or_path, *model_args, **kwargs)
    540         # instantiate config with corresponding kwargs
    541         config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config, **kwargs)
--> 542         return cls(encoder=encoder, decoder=decoder, config=config)
    543 
    544     @add_start_docstrings_to_model_forward(ENCODER_DECODER_INPUTS_DOCSTRING)

/usr/local/lib/python3.11/dist-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py in __init__(self, config, encoder, decoder)
    256         decoder_signature = set(inspect.signature(self.decoder.forward).parameters.keys())
    257         if "encoder_hidden_states" not in decoder_signature:
--> 258             raise ValueError(
    259                 "The selected decoder is not prepared for the encoder hidden states to be passed. Please see the "
    260                 "following discussion on GitHub: https://github.com/huggingface/transformers/issues/23350"

ValueError: The selected decoder is not prepared for the encoder hidden states to be passed. Please see the following discussion on GitHub: https://github.com/huggingface/transformers/issues/23350
  • at least model = EncoderDecoderModel.from_encoder_decoder_pretrained("answerdotai/ModernBERT-base", "gpt2") allows itself to be instantiated, havent tested further use

On a related note, this would be a great exhibit in a blog post/paper How Delayed and Hidden Errors Impede Research Progress in Transformers


Edit: The models, model sizes, and tasks evaluated in "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks" are outdated. Before making decisions related to those findings, it would be important to use modern datasets/tasks like FLAN (or better) and evaluate on real benchmarks (like the FLAN-t5 models are now). Also, since Roberta2GPT performs best on one of their tasks and fairly well on the other, combined with the proven success of VisionEncoderDecoder + gpt-like decoders (Nougat, Vit-gpt2). I would remain highly skeptical that "it isn't good to use pretrained decoder-only models"

Sign up or log in to comment