Truncating Text

#76
by frbackup - opened

I am trying to run this model on content that is 250-500 words long. However, the model card says that it truncates text after 256 words. Any suggestions for similar models that can handle up to 500 words?

Cheers, I hope you are doing well,

Truncation does not mean that the context is not given to the model. The term Truncation comes from Tokenization, where the model needs to ingest a specific range of data for its Neural Network. For that, it is tokenized and truncated into specific lengths of data. For example, if the model specifies that it truncates the text after 256 words, the texts with a larger amount of words would be truncated to this specific amount. However, smaller amounts of text would be padded (extended) to the length of 256 as well because the model must always accept the same length of data in its depths.
Directly answering your question has a very slight effect if you have 500 words or more for the embedding models. The context, most of the time, will be considered.
Anyways, you can always consider some additional models in the official repository of the Sentence Transformers:https://sbert.net/docs/sentence_transformer/pretrained_models.html.
Also, I have found some of the models which consider lengths of 512, they are hosted in the HF repo as this one:
sentence-transformers/gtr-t5-xxl
gtr-t5-large (A bit large (640mbs) than the others but very good for embeddings in general)
all-mpnet-base-v1 (Recommended, as I have used it in my research)
multi-qa-mpnet-base-dot-v1 (Gave the best results for sematic search in my research and is very light 420mbs)

Let me know if you have any additional questions.
Yuriy Perezhohin

@yuriyvnv - Would you mind elaborating on this? It seems like I need to limit my chunk sizes to 256 tokens otherwise the embedding doesn't capture the full context. Is this correct or am I misunderstanding something? Thanks in advance!

Sign up or log in to comment