About Sequence Length on Reranker model

#8
by terilias - opened

Hello and thanks for publishing such great models!

I'm using "bge-reranker-large" to rerank some text chunks after retrieving them from a RAG pipeline using "bge-large-en-v1.5". I am encountering the following error multiple times, indicating a token limit exceed: {'error': 'Input validation error: inputs must have less than 512 tokens. Given: 556', 'error_type': 'Validation'}.

I am using the tokenizer from "bge-large-en-v1.5" for creating the chunks. However, I noticed that the "bge-reranker-large" tokenizer works differently, and some chunks appear to have a length > 512 tokens with the reranker's tokenizer, whereas their length is <512 with the embedding's tokenizer. Consequently, I decided to use the reranker's tokenizer for chunk creation to avoid exceeding the limit. Despite this, I still encounter the same error. So I thought that I need to count the total tokens from both chunk and query.

So, my question is: is the 512 tokens limit for each chunk or for the chunk-query pair? Do I need to limit the total length of the chunk and query tokens to 512 or just the length of the chunk?

Beijing Academy of Artificial Intelligence org

Thanks for your interest in our work!
For reranker model, query and passage are concatenated and inputted into the model together. Therefore, you need to limit the total length of "query+chunk".
If you use the tokenizer from transformers library, you can set truncation=True, max_length=512 to truncate the input text.

Thank you for the immediate response!

terilias changed discussion status to closed

Sign up or log in to comment