afschowdhury
/

retrieval-mpnet-bn

@@ -7,46 +7,74 @@ tags:
 - transformers
 - dense-passage-retrieval
 widget:
-- source_sentence: আফগানিস্তান কত রান করেছিল
   sentences:
-  - >-
-    ম্যাচটা সিকান্দার রাজারই ছিল। অন্তত রান তাড়ায় নামা শ্রীলঙ্কার ইনিংসের ১৫
-    ওভার পর্যন্ত অবশ্যই। কিন্তু ব্যাটে বলে দারুণ খেলা জিম্বাবুয়ে অধিনায়ককে হাসতে
-    দিলেন না শ্রীলঙ্কার দুই অভিজ্ঞ ক্রিকেটার। অ্যাঞ্জেলো ম্যাথুস-দাসুন শানাকার
-    সপ্তম উইকেট জুটি ম্যাচ বের করে নেয় জিম্বাবুয়ের নাগাল থেকে। ম্যাথুস অবশ্য
-    দলকে জিতিয়ে ফিরতে পারেননি। তিনি যখন আউট হন, ২ বলে ৬ রান দরকার শ্রীলঙ্কার।
-    দুষ্মন্ত চামিরা ৪ ও ২ রান নিয়ে শেষ বলে গড়ানো ম্যাচে জয় এনে দলকে।
-  - >-
-    অক্ষর প্যাটেল ও অর্শদীপ সিংয়ের দারুণ বোলিংয়ের পর যশস্বী জয়সোয়াল ও শিবম দুবের
-    জোড়া অর্ধশতকে ইন্দোরে সহজ জয়ে এক ম্যাচ বাকি থাকতেই সিরিজ জিতেছে ভারত।
-    ইন্দোরে তিনে নামা গুলবদিন নাইবের ৩৫ বলে ৫৭ রানের ইনিংসে আফগানিস্তান তুলেছিল
-    ১৭২ রান, কিন্তু ভারত সেটি পেরিয়ে গেছে ২৬ বল ও ৬ উইকেট বাকি রেখেই।
-  - >-
-    এদিন প্রথম থেকে আক্রমণ ও বল দখলে এগিয়ে ছিল মিসরই। প্রতিযোগিতার সবচেয়ে সফল
-    দলটির এগিয়ে যেতে সময় লাগে মাত্র ২ মিনিট। বাঁ পাশ থেকে আসা ক্রসে সালাহ চেষ্টা
-    করেও ঠিকঠাক সংযোগ ঘটাতে পারেননি। তবে তাঁর পায়ের ছোঁয়ায় বল আসে মোস্তফা
-    মোহাম্মদের কাছে। ভুল করেননি এই ফরোয়ার্ড। দারুণ ফিনিশিংয়ে গোল করে এগিয়ে দেন
-    দলকে।
-  - >-
-    আবহাওয়া বেলুনটি ঢাকা থেকে ১২০ কিলোমিটার দূরে কুমিল্লায় অক্ষত অবস্থায় অবতরণ
-    করে। আবহাওয়া পর্যবেক্ষণ বেলুনটি বায়ুমণ্ডলের বিভিন্ন উচ্চতায় তাপমাত্রা,
-    আর্দ্রতা, বাতাসের গতি এবং বায়ুমণ্ডলের অবস্থা পরিমাপ করার জন্য তৈরি করা
-    হয়েছে। এক সংবাদ বিজ্ঞপ্তিতে এ তথ্য জানিয়েছে এআইইউবি।
-  example_title: Bengali News Example
 language:
 - bn
-datasets:
-- afschowdhury/mujib-dataset
-- csebuetnlp/squad_bn
 ---
-# {MODEL_NAME}
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 <!--- Describe your model here -->
-## Usage (Sentence-Transformers)
 Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
@@ -57,96 +85,114 @@ pip install -U sentence-transformers
 Then you can use the model like this:
 ```python
-from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('{MODEL_NAME}')
-embeddings = model.encode(sentences)
-print(embeddings)
 ```
-## Usage (HuggingFace Transformers)
 Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
-#Mean Pooling - Take attention mask into account for correct averaging
 def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-# Sentences we want sentence embeddings for
-sentences = ['This is an example sentence', 'Each sentence is converted']
-# Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
-model = AutoModel.from_pretrained('{MODEL_NAME}')
-# Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-# Compute token embeddings
-with torch.no_grad():
-    model_output = model(**encoded_input)
-# Perform pooling. In this case, mean pooling.
-sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
-print("Sentence embeddings:")
-print(sentence_embeddings)
-```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
-## Training
-The model was trained with the parameters:
-**DataLoader**:
-`torch.utils.data.dataloader.DataLoader` of length 30470 with parameters:
-```
-{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
 ```
-**Loss**:
-`sentence_transformers.losses.MSELoss.MSELoss`
-Parameters of the fit()-Method:
-```
-{
-    "epochs": 2,
-    "evaluation_steps": 0,
-    "evaluator": "NoneType",
-    "max_grad_norm": 1,
-    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
-    "optimizer_params": {
-        "eps": 1e-06,
-        "lr": 2e-05
-    },
-    "scheduler": "WarmupLinear",
-    "steps_per_epoch": null,
-    "warmup_steps": 6094,
-    "weight_decay": 0.01
-}
-```
 ## Full Model Architecture
 ```
 SentenceTransformer(
   (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
@@ -154,6 +200,6 @@ SentenceTransformer(
 )
 ```
-## Citing & Authors
-<!--- Describe where people can find more information -->

 - transformers
 - dense-passage-retrieval
 widget:
+- source_sentence: "আফগানিস্তান কত রান করেছিল"
   sentences:
+        - "ম্যাচটা সিকান্দার রাজারই ছিল। অন্তত রান তাড়ায় নামা শ্রীলঙ্কার ইনিংসের ১৫ ওভার পর্যন্ত অবশ্যই। কিন্তু ব্যাটে বলে দারুণ খেলা জিম্বাবুয়ে অধিনায়ককে হাসতে দিলেন না শ্রীলঙ্কার দুই অভিজ্ঞ ক্রিকেটার। অ্যাঞ্জেলো ম্যাথুস-দাসুন শানাকার সপ্তম উইকেট জুটি ম্যাচ বের করে নেয় জিম্বাবুয়ের নাগাল থেকে। ম্যাথুস অবশ্য দলকে জিতিয়ে ফিরতে পারেননি। তিনি যখন আউট হন, ২ বলে ৬ রান দরকার শ্রীলঙ্কার। দুষ্মন্ত চামিরা ৪ ও ২ রান নিয়ে শেষ বলে গড়ানো ম্যাচে জয় এনে দলকে। "
+        - "অক্ষর প্যাট���ল ও অর্শদীপ সিংয়ের দারুণ বোলিংয়ের পর যশস্বী জয়সোয়াল ও শিবম দুবের জোড়া অর্ধশতকে ইন্দোরে সহজ জয়ে এক ম্যাচ বাকি থাকতেই সিরিজ জিতেছে ভারত। ইন্দোরে তিনে নামা গুলবদিন নাইবের ৩৫ বলে ৫৭ রানের ইনিংসে আফগানিস্তান তুলেছিল ১৭২ রান, কিন্তু ভারত সেটি পেরিয়ে গেছে ২৬ বল ও ৬ উইকেট বাকি রেখেই।"
+        - "এদিন প্রথম থেকে আক্রমণ ও বল দখলে এগিয়ে ছিল মিসরই। প্রতিযোগিতার সবচেয়ে সফল দলটির এগিয়ে যেতে সময় লাগে মাত্র ২ মিনিট। বাঁ পাশ থেকে আসা ক্রসে সালাহ চেষ্টা করেও ঠিকঠাক সংযোগ ঘটাতে পারেননি। তবে তাঁর পায়ের ছোঁয়ায় বল আসে মোস্তফা মোহাম্মদের কাছে। ভুল করেননি এই ফরোয়ার্ড। দারুণ ফিনিশিংয়ে গোল করে এগিয়ে দেন দলকে।"
+        - "আবহাওয়া বেলুনটি ঢাকা থেকে ১২০ কিলোমিটার দূরে কুমিল্লায় অক্ষত অবস্থায় অবতরণ করে। আবহাওয়া পর্যবেক্ষণ বেলুনটি বায়ুমণ্ডলের বিভিন্ন উচ্চতায় তাপমাত্রা, আর্দ্রতা, বাতাসের গতি এবং বায়ুমণ্ডলের অবস্থা পরিমাপ করার জন্য তৈরি করা হয়েছে। এক সংবাদ বিজ্ঞপ্তিতে এ তথ্য জানিয়েছে এআইইউবি।"
+  example_title: "Bengali News Example"
+# widget:
+# - source_sentence: "That is a happy person"
+#   sentences:
+#     - "That is a happy dog"
+#     - "That is a very happy person"
+#     - "Today is a sunny day"
+#   example_title: "Happy"
 language:
 - bn
 ---
+# `retrival-mpnet-bn`
+This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like **clustering** or **semantic search**.
 <!--- Describe your model here -->
+## Model Details
+- Model name: retrival-mpnet-bn
+- Model version: 1.0
+- Architecture: Sentence Transformer
+- Language: Multilingual ( fine-tuned for Bengali Language)
+## Training
+The model was fine-tuned using  **Multilingual Knowledge Distillation** method. We selected [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) model and added a `mean tokens pooling` layer  as the teacher model
+```
+from sentence_transformers import models, SentenceTransformer
+mpnet_model = models.Transformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')
+pooling_model = models.Pooling(mpnet_model.get_word_embedding_dimension(),
+                               pooling_mode_mean_tokens=True,
+                               pooling_mode_cls_token=False,
+                               pooling_mode_max_tokens=False)
+teacher = SentenceTransformer(modules=[mpnet_model, pooling_model])
+```
+and  `xlm-roberta-large` as the student model hence it's a multilingual model and works relatively well for Bengali .
+![image](https://i.ibb.co/8Xrgnfr/sentence-transformer-model.png)
+## Intended Use:
+Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
+Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
+- **Primary Use Case:**
+  - **Open-domain question answering:** Answering natural language questions using a large text corpus.
+  - **Document retrieval:** Finding relevant documents based on user queries.
+  - **Information retrieval tasks:** Building other information retrieval systems that require efficient passage retrieval
+- **Potential Use Cases:** Semantic Similarity, Recommendation systems, Chatbot systems , FAQ system
+## Usage
+### Using Sentence-Transformers
 Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 Then you can use the model like this:
 ```python
+from sentence_transformers import SentenceTransformer, util
+query = "আফগানিস্তান কত রান করেছিল"
+docs = ["ম্যাচটা সিকান্দার রাজারই ছিল। অন্তত রান তাড়ায় নামা শ্রীলঙ্কার ইনিংসের ১৫ ওভার পর্যন্ত অবশ্যই। কিন্তু ব্যাটে বলে দারুণ খেলা জিম্বাবুয়ে অধিনায়ককে হাসতে দিলেন না শ্রীলঙ্কার দুই অভিজ্ঞ ক্রিকেটার। অ্যাঞ্জেলো ম্যাথুস-দাসুন শানাকার সপ্তম উইকেট জুটি ম্যাচ বের করে নেয় জিম্বাবুয়ের নাগাল থেকে। ম্যাথুস অবশ্য দলকে জিতিয়ে ফিরতে পারেননি। তিনি যখন আউট হন, ২ বলে ৬ রান দরকার শ্রীলঙ্কার। দুষ্মন্ত চামিরা ৪ ও ২ রান নিয়ে শেষ বলে গড়ানো ম্যাচে জয় এনে দলকে। ",
+"অক্ষর প্যাটেল ও অর্শদীপ সিংয়ের দারুণ বোলিংয়ের পর যশস্বী জয়সোয়াল ও শিবম দুবের জোড়া অর্ধশতকে ইন্দোরে সহজ জয়ে এক ম্যাচ বাকি থাকতেই সিরিজ জিতেছে ভারত। ইন্দোরে তিনে নামা গুলবদিন নাইবের ৩৫ বলে ৫৭ রানের ইনিংসে আফগানিস্তান তুলেছিল ১৭২ রান, কিন্তু ভারত সেটি পেরিয়ে গেছে ২৬ বল ও ৬ উইকেট বাকি রেখেই।",
+"এদিন প্রথম থেকে আক্রমণ ও বল দখলে এগিয়ে ছিল মিসরই। প্রতিযোগিতার সবচেয়ে সফল দলটির এগিয়ে যেতে সময় লাগে মাত্র ২ মিনিট। বাঁ পাশ থেকে আসা ক্রসে সালাহ চেষ্টা করেও ঠিকঠাক সংযোগ ঘটাতে পারেননি। তবে তাঁর পায়ের ছোঁয়ায় বল আসে মোস্তফা মোহাম্মদের কাছে। ভুল করেননি এই ফরোয়ার্ড। দারুণ ফিনিশিংয়ে গোল করে এগিয়ে দেন দলকে।"]
+# Load the model
+model = SentenceTransformer('afschowdhury/retrival-mpnet-bn')
+# Encode the query and documents
+query_emb = model.encode(query)
+doc_emb = model.encode(docs)
+#Compute dot score between query and all document embeddings
+scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
+#Combine docs & scores
+doc_score_pairs = list(zip(docs, scores))
+#Sort by decreasing score
+doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
+#Output passages & scores
+for doc, score in doc_score_pairs:
+    print(score, doc)
 ```
+### Using HuggingFace Transformers
 Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
+import torch.nn.functional as F
+#Mean Pooling - Take average of all tokens
 def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+#Encode text
+def encode(texts):
+    # Tokenize sentences
+    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
+    # Compute token embeddings
+    with torch.no_grad():
+        model_output = model(**encoded_input, return_dict=True)
+    # Perform pooling
+    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+    # Normalize embeddings
+    embeddings = F.normalize(embeddings, p=2, dim=1)
+    return embeddings
+# Sentences we want sentence embeddings for
+query = "আফগানিস্তান কত রান করেছিল"
+docs = ["ম্যাচটা সিকান্দার রাজারই ছিল। অন্তত রান তাড়ায় নামা শ্রীলঙ্কার ইনিংসের ১৫ ওভার পর্যন্ত অবশ্যই। কিন্তু ব্যাটে বলে দারুণ খেলা জিম্বাবুয়ে অধিনায়ককে হাসতে দিলেন না শ্রীলঙ্কার দুই অভিজ্ঞ ক্রিকেটার। অ্যাঞ্জেলো ম্যাথুস-দাসুন শানাকার সপ্তম উইকেট জুটি ম্যাচ বের করে নেয় জিম্বাবুয়ের নাগাল থেকে। ম্যাথুস অবশ্য দলকে জিতিয়ে ফিরতে পারেননি। তিনি যখন আউট হন, ২ বলে ৬ রান দরকার শ্রীলঙ্কার। দুষ্মন্ত চামিরা ৪ ও ২ রান নিয়ে শেষ বলে গড়ানো ম্যাচে জয় এনে দলকে। ",
+"অক্ষর প্যাটেল ও অর্শদীপ সিংয়ের দারুণ বোলিংয়ের পর যশস্বী জয়সোয়াল ও শিবম দুবের জোড়া অর্ধশতকে ইন্দোরে সহজ জয়ে এক ম্যাচ বাকি থাকতেই সিরিজ জিতেছে ভারত। ইন্দোরে তিনে নামা গুলবদিন নাইবের ৩৫ বলে ৫৭ রানের ইনিংসে আফগানিস্তান তুলেছিল ১৭২ রান, কিন্তু ভারত সেটি পেরিয়ে গেছে ২৬ বল ও ৬ উইকেট বাকি রেখেই।",
+"এদিন প্রথম থেকে আক্রমণ ও বল দখলে এগিয়ে ছিল মিসরই। প্রতিযোগিতার সবচেয়ে সফল দলটির এগিয়ে যেতে সময় লাগে মাত্র ২ মিনিট। বাঁ পাশ থেকে আসা ক্রসে সালাহ চেষ্টা করেও ঠিকঠাক সংযোগ ঘটাতে পারেননি। তবে তাঁর পায়ের ছোঁয়ায় বল আসে মোস্তফা মোহাম্মদের কাছে। ভুল করেননি এই ফরোয়ার্ড। দারুণ ফিনিশিংয়ে গোল করে এগিয়ে দেন দলকে।"]
+# Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained("afschowdhury/retrival-mpnet-bn")
+model = AutoModel.from_pretrained("afschowdhury/retrival-mpnet-bn")
+#Encode query and docs
+query_emb = encode(query)
+doc_emb = encode(docs)
+#Compute dot score between query and all document embeddings
+scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
+#Combine docs & scores
+doc_score_pairs = list(zip(docs, scores))
+#Sort by decreasing score
+doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
+#Output passages & scores
+for doc, score in doc_score_pairs:
+    print(score, doc)
 ```
+## Technical Details
+In the following some technical details how this model must be used:
+| Setting                        | Value                                       |
+| ------------------------------ | ------------------------------------------- |
+| Dimensions                     | 768                                         |
+| Produces normalized embeddings | No                                         |
+| Pooling-Method                 | Mean pooling                                |
+| Suitable score functions       | dot-product (`util.dot_score`), cosine-similarity (`util.cos_sim`), or euclidean distance |
+----
+**Note:** When loaded with sentence-transformers, this model doesn;t produces normalized embeddings like it's base model as while training , we didn't added the normalzed layer in student model's architecture.  In that case, dot-product and cosine-similarity aren't equivalent. However, for retrieval applications, the performance difference is negligible. For similarity search, we recommend to use cosine-similarity as score function.
+<!-- write a background section -->
+<!-- write  about training data and training procedure and losses -->
 ## Full Model Architecture
 ```
 SentenceTransformer(
   (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
 )
 ```
+### Point of Contact
+**Asif Faisal Chowdhury**
+E-mail: [afschowdhury@gmail.com](mailto:afschowdhury@gmail.com) | Linked-in: [afschowdhury](https://www.linkedin.com/in/afschowdhury)