Minor changes for correct inference

#1
by tomaarsen HF staff - opened

Hello @Lolalb and team!

Preface

Congratulations on your model release! Very exciting to see another strong encoder out there. I already started some training runs of my own to experiment! I ran into some minor issues regarding the transformers integration, that this PR tackles.

Pull Request overview

  • Update AutoModel... in config.json
  • Add base_model_prefix="model" on PreTrainedModel
  • Cast attention_mask to bool in SDPA
  • Tag this model as transformers-compatible
  • Specify that we don't want the token_type_ids from the tokenizer

Details

Updating config.json

Previously, loading an AutoModel.from_pretrained actually loaded the Masked language modeling model, and it was not possible to load the Sequence Classification model. Together with updating the base_model_prefix in model.py, it's now possible to do AutoModel.from_pretrained, AutoModelForMaskedLM.from_pretrained and AutoModelForSequenceClassification.from_pretrained.

Cast attention mask to bool

In SDPA, the attention mask was an int tensor before, which SDPA doesn't like. I'm doing some training with this model now, but I have a custom data collator, so it's not using your data collator for Flash Attention 2 support.

No token_type_ids from the tokenizer

The BERT tokenizer outputs token_type_ids by default. This used to be used to recognize which sentence was which if you provided pairs of sentences. However, it's kind of fallen out of style. We don't need it, but the tokenizer does return it, so this removes that.

You can test all this stuff nicely by using the revision argument:

from transformers import AutoModel, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, revision="refs/pr/1")
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, revision="refs/pr/1")

# Tokenize input text
text = "NeoBERT is the most efficient model of its kind!"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
print(embedding.shape)

I'll be uploading my first finetune of this model in a few minutes. It looks to be very strong, stronger than my same model with ModernBERT-base.
Edit: Here's the NeoBERT model and my ModernBERT-base baseline, using the same training script:

Interestingly, NeoBERT is much stronger in-domain, but much worse out-of-domain with my finetuned model here.

If there is a lot of interest from the community, it might make sense to introduce neobert as an architecture in transformers, so that users won't have to use trust_remote_code anymore. I do have to preface that we don't have xformers in pure transformers, so we would need a "manual" swiglu instead.

cc @stefan-it as you were also working on NeoBERT, I believe.

  • Tom Aarsen
tomaarsen changed pull request status to open

Also, I noticed that throughout your paper, you describe the model as NeoBERT-medium. If you imagine ever making a smaller or larger variant of this model, this might be a good time to rename the model to chandar-lab/NeoBERT-medium so you don't shoot yourself in the foot by having this model be the one NeoBERT model. Note that you can then still load the model with chandar-lab/NeoBERT, and https://huggingface.co/chandar-lab/NeoBERT should still work.

Chandar Research Lab org

Hi @tomaarsen , thanks a lot for these modifications and your comments! We're excited to see that NeoBERT is performing well in your experiments. We are considering training other sizes if we do get the necessary compute, in which case we would also remove the xformers dependency from these models (unfortunately, it seems tricky to do so for this version).

Lolalb changed pull request status to merged

Sign up or log in to comment