dicta-il/dictabert-morph · any limit to the input text length?

Sep 10, 2023

using the following code:

'''
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-morph', cache_dir = r"F:\nlp_project\py\dictabert-morph")
model = AutoModel.from_pretrained('dicta-il/dictabert-morph', trust_remote_code=True, cache_dir = r"F:\nlp_project\py\dictabert-morph")

res = model.predict([txt], tokenizer)

'''

the result only tokenizes and processes the first few tokens of the text

cant attach here JSON or txt file so pasting full output in comments

msperka

Sep 10, 2023

This comment has been hidden

Shaltiel

DICTA: The Israel Center for Text Analysis org Sep 10, 2023

•

edited Sep 10, 2023

The model dictabert was pretrained with a window of 512 tokens, and when you input a longer sentence the predict function truncates it to the maximum length.
The model dictabert-morph was finetuned on sentence units (of lengths <512).

Therefore, the model can't handle inputs of longer than 512 tokens and probably is not ideal for handling multiple concatenated sentences.

I'd recommend splitting your input into a list of sentence units and then sending the list of sentences to the predict function. (note: sending multiple sentences will results in them being run through the model as a single batch, so if resources are limited it's probably best to send them one at a time).

msperka

Sep 10, 2023

Thank you

msperka changed discussion status to closed Sep 10, 2023