Is there something wrong with the position embedding or tokenizer?

by JinGao - opened Sep 14, 2024

Discussion

JinGao

Sep 14, 2024

•

edited Sep 14, 2024

Is there something wrong with the position embedding or tokenizer?

File "/home/....../python3.11/site-packages/multimolecule/models/rnabert/modeling_rnabert.py", line 694, in forward
    embeddings += position_embeddings
RuntimeError: The size of tensor a (503) must match the size of tensor b (440) at non-singleton dimension 1

I have also tried RnaTokenizer.from_pretrained(model_name, cls_token=None, eos_token=None), but nothing changes but the 503 turns into 502.

JinGao changed discussion title from Is something wrong with the position embedding or tokenizer? to Is there something wrong with the position embedding or tokenizer? Sep 14, 2024

ZhiyuanChen

MultiMolecule org Sep 14, 2024

•

edited Sep 14, 2024

RNABERT uses a learned position encoding.

Therefore, it cannot process sequences longer than 440 tokens (actually 438 in our implementation as we prepend <CLS> and append <EOS> to the sequence).

If it’s ok to truncate the sequence, you can pass truncate=True to the tokenizer.
Otherwise, you should consider switching to other models either supports a longer sequence (like RNA-FM) or support sequence length extrapolation (like RiNALMo).

You can try use the MultiMolecule.Dataset, which we have included helper functions to truncate everything for you.

JinGao changed discussion status to closed Sep 14, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment