Parameters for peak performances

#8
by cvdbdo - opened

Are there any stats on performance on the same dataset when changing the document chunk size, chunk strategy, languages, or model quantization?
By trial and error it seems to me that a smaller (i.e. a few sentences MAX) chunks tend to perform better.
I am trying to compare different embedders at their best, using the proper parameters for each.

StellaEncoder org

Hi @cvdbdo , this is really a valuable question ๐Ÿ˜„๐Ÿ˜„๐Ÿ˜„.
Stella was trained on data of max_length=512, so for stella, max_lenght<1024 may be suitable.

Your questions quite grand and complicated, I only can provide some personal opinions:
document chunk size: Generally speaking, this is a balance between recall and reference length. For example, in RAG, when the chunk size is big, the recall is better, however, the prompt will be longer and contains many noisy texts.

chunk strategy: Please refer to https://python.langchain.com/v0.2/docs/concepts/#text-splitters

languages: dunzhang/stella_en_1.5B_v5 is for English.

model quantization: The model was trained with BF16. I am working on int8, int4......

it seems to me that a smaller (i.e. a few sentences MAX) chunks tend to perform better.

Yes, same feeling, two reasons:

  1. most models's training data's length is less than 1024
  2. most test data, for example (query, document), the answer is in the first half of this text.

I really focus solely on a proper retrieval here. If I want a bigger chunk size for the generation part there is no problem in giving to the LLM the chunks around the retrieved one for added context. So the decision about the chunk size shouldn't even bother with the generation.
I also explored more complex chunk strategies. For example when I have access to structured documents, I have the ability to construct chunks in the form:

Doc title
header 1
sub header 1.6
sub header 1.6.5
content

This structure has the advantage of preserving the "context" of the content, and is my go to for small embedding models such as intfloat multilingual. However, plainly using the same method with stella_1.5 gives out much worse results.
Similarly, one can have a double embedding, for the titles and for the content, and use a reranker to mix both in the retrieval process.
It seems to me that retrieval dataset all have a set of chunks associated with questions and answers at best, while in reality nobody starts with chunks, we all start with documents, usually unholy unparsable disgusting pdfs, and as such the chunking strategy is never set in stone and testing all models on such "simple" datasets is only showing half the process.

StellaEncoder org

Hi, @cvdbdo
A long time ago, I had the idea of letting the vector model handle tabular and MD data, but time and effort constraints prevented me from doing it (It's my hobby project).
Now your comment strengthens my resolve to do this.

However, plainly using the same method with stella_1.5 gives out much worse results.

  1. No Free Lunch Theorem. Each domain or dataset has a most suitable model.
  2. Are you using the right prompt for the query in the test set?
  3. If your security policy allows sharing of this test set, could you share them with me, it would be very helpful for me to optimise the performance of the model!

Sign up or log in to comment