ctu-aic
/

m2m100-418M-multilingual-summarization-multilarge-cs

@@ -34,6 +34,61 @@ metrics:
 This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
 ## Task
 The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'
 ## Dataset
 Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
 ```

 This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
 ## Task
 The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'
+Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.
+```python
+## Configuration of summarization pipeline
+#
+def summ_config():
+    cfg = OrderedDict([
+        ## summarization model - checkpoint
+        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
+        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
+        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
+        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
+        ## language of summarization task
+        #   language : string : cs, en, de, fr, es, tr, ru, zh
+        ("language", "en"),
+        ## generation method parameters in dictionary
+        #
+        ("inference_cfg", OrderedDict([
+            ("num_beams", 4),
+            ("top_k", 40),
+            ("top_p", 0.92),
+            ("do_sample", True),
+            ("temperature", 0.95),
+            ("repetition_penalty", 1.23),
+            ("no_repeat_ngram_size", None),
+            ("early_stopping", True),
+            ("max_length", 128),
+            ("min_length", 10),
+        ])),
+        #texts to summarize values = (list of strings, string, dataset)
+        ("texts",
+            [
+               "english text1 to summarize",
+               "english text2 to summarize",
+            ]
+        ),
+        #OPTIONAL: Target summaries values = (list of strings, string, None)
+        ('golds',
+         [
+               "target english text1",
+               "target english text2",
+         ]),
+        #('golds', None),
+    ])
+    return cfg
+cfg = summ_config()
+mSummarize = MultiSummarizer(**cfg)
+ret = mSummarize(**cfg)
+```
 ## Dataset
 Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
 ```