djovak
/

embedic-large

@@ -2,14 +2,25 @@
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
 ---
 # djovak/embedic-large
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 <!--- Describe your model here -->
@@ -26,22 +37,54 @@ Then you can use the model like this:
 ```python
 from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
 model = SentenceTransformer('djovak/embedic-large')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=djovak/embedic-large)
 ## Full Model Architecture
 ```
@@ -52,6 +95,6 @@ SentenceTransformer(
 )
 ```
-## Citing & Authors
-<!--- Describe where people can find more information -->

 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
+- mteb
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
+license: mit
+language:
+- multilingual
+- en
+- sr
 ---
 # djovak/embedic-large
+Say hello to **Embedić**, a group of new text embedding models finetuned for the Serbian language!
+These models are particularly useful in Information Retrieval and RAG purposes. Check out images showcasing benchmark performance, you can beat previous SOTA with 5x fewer parameters!
+Although specialized for Serbian(Cyrillic and Latin scripts), Embedić is Cross-lingual(it understands English too). So you can embed English docs, Serbian docs, or a combination of the two :)
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 <!--- Describe your model here -->
 ```python
 from sentence_transformers import SentenceTransformer
+sentences = ["ko je Nikola Tesla?", "Nikola Tesla je poznati pronalazač", "Nikola Jokić je poznati košarkaš"]
 model = SentenceTransformer('djovak/embedic-large')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
+### Important usage notes
+- "ošišana ćirilica" (usage of c instead of ć, etc...) significantly deacreases search quality
+- The usage of uppercase letters for named entities can significantly improve search quality
+## Evaluation
+### **Model description**:
+| Model Name |  Dimension | Sequence Length | Parameters
+|:----:|:---:|:---:|:---:|
+| [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 384 | 512 | 117M
+| [djovak/embedic-small](https://huggingface.co/djovak/embedic-small) |  384 | 512 | 117M
+|||||||||
+| [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) |  768 | 512 | 278M
+| [djovak/embedic-base](https://huggingface.co/djovak/embedic-base) |  768 | 512 | 278M
+|||||||||
+| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) |  1024 | 512 |  560M
+| [djovak/embedic-large](https://huggingface.co/djovak/embedic-large) |  1024 | 512 | 560M
+`BM25-ENG` - Elasticsearch with English analyzer
+`BM25-SRB` - Elasticsearch with Serbian analyzer
+### evaluation resultsresults
+Evaluation on 3 tasks: Information Retrieval, Sentence Similarity, and Bitext mining. I personally translated the STS17 cross-lingual evaluation dataset and Spent 6,000$ on Google translate API, translating 4 IR evaluation datasets into Serbian language.
+Evaluation datasets will be published as Part of [MTEB benchmark](https://huggingface.co/spaces/mteb/leaderboard) in the near future.
+![information retrieval results](image-2.png)
+![sentence similarity results](image-1.png)
+## Contact
+If you have any question or sugestion related to this project, you can open an issue or pull request. You can also email me at novakzivanic@gmail.com
 ## Full Model Architecture
 ```
 )
 ```
+## License
+Embedić models are licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.

image-1.png ADDED Viewed

image-2.png ADDED Viewed