Is there something wrong with the position embedding or tokenizer?

#1
by JinGao - opened

Is there something wrong with the position embedding or tokenizer?

File "/home/....../python3.11/site-packages/multimolecule/models/rnabert/modeling_rnabert.py", line 694, in forward
    embeddings += position_embeddings
RuntimeError: The size of tensor a (503) must match the size of tensor b (440) at non-singleton dimension 1

I have also tried RnaTokenizer.from_pretrained(model_name, cls_token=None, eos_token=None), but nothing changes but the 503 turns into 502.

JinGao changed discussion title from Is something wrong with the position embedding or tokenizer? to Is there something wrong with the position embedding or tokenizer?

RNABERT uses a learned position encoding.

Therefore, it cannot process sequences longer than 440 tokens (actually 438 in our implementation as we prepend <CLS> and append <EOS> to the sequence).

If it’s ok to truncate the sequence, you can pass truncate=True to the tokenizer.
Otherwise, you should consider switching to other models either supports a longer sequence (like RNA-FM) or support sequence length extrapolation (like RiNALMo).

You can try use the MultiMolecule.Dataset, which we have included helper functions to truncate everything for you.

JinGao changed discussion status to closed

Sign up or log in to comment