type mismatch
type: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5" specifies float32 as torch_dtype. The qwen models are set in bfloat16.
Most frameworks will autocrats float32 to float16. This will likely reduce quality, as the qwen models are sensitive to bf16.
type: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5" specifies float32 as torch_dtype. The qwen models are set in bfloat16.
Thank you for your meticulous observations and suggestions.
We have identified that the inconsistency in torch_dtype within config.json originates from the function used to export the sentence_transformer in the training framework.
In other words, although our model was trained using bfloat16, the parameters were inadvertently saved as float32. However, specifying torch_dtype during inference loading should be a viable solution.
We will promptly verify the potential impact of this torch_dtype on the model's performance to determine whether modifications to the repository are necessary.
E.g. trt-llm will autocast the following rules:
-> bfloat16 -> bfloat16
-> float16 -> float16
-> float32 -> float16
A manual configuration is not feasible, or often ignored.
E.g. trt-llm will autocast the following rules:
-> bfloat16 -> bfloat16
-> float16 -> float16
-> float32 -> float16A manual configuration is not feasible, or often ignored.
Thank you for your sharing.
For other frameworks, such as sentence-transformers, we will need to conduct comprehensive testing when we have the time.
At present, it seems that this level of precision does not significantly affect the performance.
Comparison: BF16 vs FP32 on CMTEB
We conducted rapid benchmarking of sentence-transformers (via transformers) on CMTEB with different precision settings. The model was loaded using:
model = SentenceTransformer(model_name_or_path, model_kwargs={"torch_dtype": "bfloat16"})
Results
Minimal performance difference between BF16 and FP32:
Data Type | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | Avg |
---|---|---|---|---|---|---|---|
FP32 | 70.11 | 51.57 | 72.94 | 70.94 | 64.38 | 57.32 | 64.13 |
BF16 | 70.14 | 51.58 | 72.94 | 70.90 | 64.30 | 57.31 | 64.12 |
Conclusion
However, the precision of the data significantly impacts computational efficiency.
We are considering whether to push an update for a new version or repository to store model parameters with bf16 precision. This would help avoid unnecessary additional inference costs that may arise if users overlook manual settings.
Cool, thanks! I just wanted to clarify that most frameworks (trt-llm/vllm/sglang) will likely "autocast" float32 to float16 (and not bfloat16) for performance reasons.