Update README.md

7c385a5 verified 7 days ago

11.7 kB

	---
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	base_model:
	- avemio/German-RAG-BGE-M3-TRIPLES-MERGED-HESSIAN-AI
	- Snowflake/snowflake-arctic-embed-l-v2.0
	base_model_relation: merge
	widget:
	- source_sentence: 'search_query: i love autotrain'
	sentences:
	- 'search_query: huggingface auto train'
	- 'search_query: hugging face auto train'
	- 'search_query: i love autotrain'
	pipeline_tag: sentence-similarity
	datasets:
	- avemio/German-RAG-EMBEDDING-TRIPLES-HESSIAN-AI
	license: mit
	language:
	- de
	- en
	---

	# German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI

	This is a merged [sentence-transformers](https://www.SBERT.net) model. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
	Our [German-RAG-BGE-M3-MERGED Model](https://huggingface.co/avemio/German-RAG-BGE-M3-TRIPLES-MERGED-HESSIAN-AI/) was merged with [Snowflake/snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) to exceed performances from each Base-Model.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	<!-- - Base model: [Unknown](https://huggingface.co/unknown) -->
	- Maximum Sequence Length: 8192 tokens
	- Output Dimensionality: 1024 tokens
	- Similarity Function: Cosine Similarity
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
	(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	(2): Normalize()
	)
	```

	## Evaluation MTEB-Tasks

	### Classification
	- AmazonCounterfactualClassification
	- AmazonReviewsClassification
	- MassiveIntentClassification
	- MassiveScenarioClassification
	- MTOPDomainClassification
	- MTOPIntentClassification

	### Pair Classification
	- FalseFriendsGermanEnglish
	- PawsXPairClassification

	### Retrieval
	- GermanQuAD-Retrieval
	- GermanDPR

	### STS (Semantic Textual Similarity)
	- GermanSTSBenchmark

	#### Comparison between the Snowflake Arctic Model ([Snowflake](https://huggingface.co/BAAI/bge-m3)), our Merged Model ([Merged-BGE](https://huggingface.co/avemio/German-RAG-BGE-M3-TRIPLES-HESSIAN-AI)) and our Merged-BGE Model merged with [Snowflake/snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0)

	\| TASK \| Snowflake \| Merged-BGE \| Merged-Snowflake \| German-RAG vs. Snowflake \| Merged-Snowflake vs. Snowflake \| Merged-Snowflake vs. Merged-BGE \|
	\|-------------------------------------\|-----------\|------------\|------------------\|--------------------\|-------------------------------\|---------------------------------\|
	\| AmazonCounterfactualClassification \| 0.6587 \| 0.7111 \| 0.7152 \| 5.24% \| 5.65% \| 0.41% \|
	\| AmazonReviewsClassification \| 0.3697 \| 0.4571 \| 0.4577 \| 8.74% \| 8.80% \| 0.06% \|
	\| FalseFriendsGermanEnglish \| 0.5360 \| 0.5338 \| 0.5378 \| -0.22% \| 0.18% \| 0.40% \|
	\| GermanQuAD-Retrieval \| 0.9423 \| 0.9311 \| 0.9456 \| -1.12% \| 0.33% \| 1.45% \|
	\| GermanSTSBenchmark \| 0.7499 \| 0.8218 \| 0.8558 \| 7.19% \| 10.59% \| 3.40% \|
	\| MassiveIntentClassification \| 0.6778 \| 0.6522 \| 0.6826 \| -2.56% \| 0.48% \| 3.04% \|
	\| MassiveScenarioClassification \| 0.7375 \| 0.7381 \| 0.7494 \| 0.06% \| 1.19% \| 1.13% \|
	\| GermanDPR \| 0.8367 \| 0.8159 \| 0.8330 \| -2.08% \| -0.37% \| 1.71% \|
	\| MTOPDomainClassification \| 0.9080 \| 0.9139 \| 0.9259 \| 0.59% \| 1.79% \| 1.20% \|
	\| MTOPIntentClassification \| 0.6675 \| 0.6684 \| 0.7143 \| 0.09% \| 4.68% \| 4.59% \|
	\| PawsXPairClassification \| 0.5887 \| 0.5710 \| 0.5803 \| -1.77% \| -0.84% \| 0.93% \|

	#### Comparison between Original Base-Model ([BGE-M3](https://huggingface.co/BAAI/bge-m3)), Merged Model with Base-Model ([Merged-BGE](https://huggingface.co/avemio/German-RAG-BGE-M3-TRIPLES-MERGED-HESSIAN-AI/)) and our Merged-BGE Model merged with [Snowflake/snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0)

	\| TASK \| [BGE-M3](https://huggingface.co/BAAI/bge-m3) \| Merged-BGE \| [Merged-Snowflake](https://huggingface.co/avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI/) \| Merged-BGE vs. BGE \| Merged-Snowflake vs. BGE \| Merged-Snowflake vs. Merged-BGE \|
	\|-------------------------------------\|-------\|------------\|------------------\|--------------------\|--------------------------\|---------------------------------\|
	\| AmazonCounterfactualClassification \| 0.6908 \| 0.7111 \| 0.7152 \| 2.94% \| 3.53% \| 0.58% \|
	\| AmazonReviewsClassification \| 0.4634 \| 0.4571 \| 0.4577 \| -1.36% \| -1.23% \| 0.13% \|
	\| FalseFriendsGermanEnglish \| 0.5343 \| 0.5338 \| 0.5378 \| -0.09% \| 0.66% \| 0.75% \|
	\| GermanQuAD-Retrieval \| 0.9444 \| 0.9311 \| 0.9456 \| -1.41% \| 0.13% \| 1.56% \|
	\| GermanSTSBenchmark \| 0.8079 \| 0.8218 \| 0.8558 \| 1.72% \| 5.93% \| 4.14% \|
	\| MassiveIntentClassification \| 0.6575 \| 0.6522 \| 0.6826 \| -0.81% \| 3.82% \| 4.66% \|
	\| MassiveScenarioClassification \| 0.7355 \| 0.7381 \| 0.7494 \| 0.35% \| 1.89% \| 1.53% \|
	\| GermanDPR \| 0.8265 \| 0.8159 \| 0.8330 \| -1.28% \| 0.79% \| 2.10% \|
	\| MTOPDomainClassification \| 0.9121 \| 0.9139 \| 0.9259 \| 0.20% \| 1.52% \| 1.31% \|
	\| MTOPIntentClassification \| 0.6808 \| 0.6684 \| 0.7143 \| -1.82% \| 4.91% \| 6.87% \|
	\| PawsXPairClassification \| 0.5678 \| 0.5710 \| 0.5803 \| 0.56% \| 2.18% \| 1.63% \|

	## Evaluation on German-RAG-EMBEDDING-BENCHMARK

	Accuracy is calculated by evaluating if the relevant context is the highest ranking embedding of the whole context array.
	See Eval-Dataset and Evaluation Code [here](https://huggingface.co/datasets/avemio/German-RAG-EMBEDDING-BENCHMARK)

	\| Model Name \| Accuracy \|
	\|-------------------------------------------------\|-----------\|
	\| [bge-m3](https://huggingface.co/BAAI/bge-m3 ) \| 0.8806 \|
	\| [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) \| 0.8393 \|
	\| [German-RAG-BGE-M3-TRIPLES-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-BGE-M3-TRIPLES-HESSIAN-AI) \| 0.8857 \|
	\| [German-RAG-BGE-M3-TRIPLES-MERGED-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-BGE-M3-TRIPLES-MERGED-HESSIAN-AI) \| 0.8866 \|
	\| [German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI) \| 0.8866 \|
	\| [German-RAG-UAE-LARGE-V1-TRIPLES-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-UAE-LARGE-V1-TRIPLES-HESSIAN-AI) \| 0.8763 \|
	\| [German-RAG-UAE-LARGE-V1-TRIPLES-MERGED-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-UAE-LARGE-V1-TRIPLES-MERGED-HESSIAN-AI) \| 0.8771 \|

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI")
	# Run inference
	sentences = [
	'The weather is lovely today.',
	"It's so sunny outside!",
	'He drove to the stadium.',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1024]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Framework Versions
	- Python: 3.10.12
	- Sentence Transformers: 3.2.1
	- Transformers: 4.44.2
	- PyTorch: 2.4.1+cu121
	- Accelerate: 0.34.2
	- Datasets: 3.0.1
	- Tokenizers: 0.19.1

	## Citation

	```
	@misc{bge-m3,
	title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
	author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
	year={2024},
	eprint={2402.03216},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```


	## The German-RAG AI Team
	[Marcel Rosiak](https://de.linkedin.com/in/marcel-rosiak)
	[Soumya Paul](https://de.linkedin.com/in/soumya-paul-1636a68a)
	[Siavash Mollaebrahim](https://de.linkedin.com/in/siavash-mollaebrahim-4084b5153?trk=people-guest_people_search-card)
	[Zain ul Haq](https://de.linkedin.com/in/zain-ul-haq-31ba35196)