token limit - warning

#3
by msperka - opened

Hi
as mentioned in Issue #2
"the model can't handle inputs of longer than 512 tokens "

is there a warning that i can get in cases i exceed the limit?
i split to sentences and in most cases its well withing the limit, but there are exceptions - any way to flag these exceptions before i run the "dictabert-morph" model ?
maybe running the tokenizer only (without the morphology) and is i reach 512 tokens i knpow i probably need to split before runing the morph model?

DICTA: The Israel Center for Text Analysis org

Right now the code automatically truncates the sentence to 512 tokens, if it exceeds the length.
A good solution would be to run the tokenizer on its own and see if the number tokens exceed 512 tokens.

Alternatively, if you have a preferred way which would need to be added into the interface, feel free to make the modifications and open a PR, we welcome contributions :)

msperka changed discussion status to closed

Sign up or log in to comment