Always predicts <unk> issue

#2
by matiasvant - opened

Hi Logan, great repo! Was the first thing I looked for when I saw ESM-C.

Following your instructions exactly, I'm getting an issue where ESM++_small always predicts 'unk' over all residues. The large model similiarly predicts the same fixed set of 1 or 2 tokens regardless of input. Best guess it's a tokenizer issue: setting all tokens to rand token gives same result, happens for both models, and the actual weights appear to be the same between ESM++_small and ESMC-300. Could be related to ESM-C vocab/output being 64-dim while this repo's 32-dim.

Quick notebook demonstrating the above: https://colab.research.google.com/drive/1OYNYGQzzbBTQRM-O1qhBXJsu9ixI19gS?usp=sharing

Any chance you might know what's causing this?

Synthyra org

Hi @matiasvant ,

Apologies for the late response. Glad you like the repo!

This happens to be an interesting quirk with ESMC - the sequence head produces nonsense if there are no mask tokens in the input. This limits its usefulness for mutagenesis studies in my opinion, but the mask recovery rate (natively or after fine-tuning more) is very high, and it produces quality embeddings regardless. You can see the issue I opened on the official repo about this here.

If you could share some more details about what you'd like to do with the model, I'm happy to help brainstorm a solution to this problem.
Best,
Logan

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment