Is there something wrong with the position embedding or tokenizer?
Is there something wrong with the position embedding or tokenizer?
File "/home/....../python3.11/site-packages/multimolecule/models/rnabert/modeling_rnabert.py", line 694, in forward
embeddings += position_embeddings
RuntimeError: The size of tensor a (503) must match the size of tensor b (440) at non-singleton dimension 1
I have also tried RnaTokenizer.from_pretrained(model_name, cls_token=None, eos_token=None)
, but nothing changes but the 503 turns into 502.
RNABERT uses a learned position encoding.
Therefore, it cannot process sequences longer than 440 tokens (actually 438 in our implementation as we prepend <CLS>
and append <EOS>
to the sequence).
If it’s ok to truncate the sequence, you can pass truncate=True
to the tokenizer.
Otherwise, you should consider switching to other models either supports a longer sequence (like RNA-FM) or support sequence length extrapolation (like RiNALMo).
You can try use the MultiMolecule.Dataset
, which we have included helper functions to truncate everything for you.