any limit to the input text length?

#2
by msperka - opened

using the following code:

'''
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-morph', cache_dir = r"F:\nlp_project\py\dictabert-morph")
model = AutoModel.from_pretrained('dicta-il/dictabert-morph', trust_remote_code=True, cache_dir = r"F:\nlp_project\py\dictabert-morph")

res = model.predict([txt], tokenizer)

'''

the result only tokenizes and processes the first few tokens of the text

cant attach here JSON or txt file so pasting full output in comments

This comment has been hidden
DICTA: The Israel Center for Text Analysis org
β€’
edited Sep 10, 2023

The model dictabert was pretrained with a window of 512 tokens, and when you input a longer sentence the predict function truncates it to the maximum length.
The model dictabert-morph was finetuned on sentence units (of lengths <512).

Therefore, the model can't handle inputs of longer than 512 tokens and probably is not ideal for handling multiple concatenated sentences.

I'd recommend splitting your input into a list of sentence units and then sending the list of sentences to the predict function. (note: sending multiple sentences will results in them being run through the model as a single batch, so if resources are limited it's probably best to send them one at a time).

Thank you

msperka changed discussion status to closed

Sign up or log in to comment