kz-transformers commited on
Commit
c7dabed
1 Parent(s): ec45403

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -17
README.md CHANGED
@@ -16,11 +16,27 @@ widget:
16
 
17
  Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Training data
20
 
21
  The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
22
  - [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
23
- - [Conversational data] Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)(https://beeline.kz/)
24
 
25
  Together these datasets weigh 25GB of text.
26
  ## Training procedure
@@ -35,21 +51,9 @@ with `<s>` and the end of one by `</s>`
35
 
36
  The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512.
37
 
38
- ## Usage
39
-
40
- You can use this model directly with a pipeline for masked language modeling:
41
 
42
- ```python
43
- >>> from transformers import pipeline
44
- >>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
45
- >>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
46
- #Out:
47
- # {'score': 0.8131822347640991,
48
- # 'token': 18749,
49
- # 'token_str': ' мағынада',
50
- # 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
51
- # ...
52
- # ...]
53
- ```
54
 
55
- ### BibTeX entry and citation info
 
 
 
16
 
17
  Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
18
 
19
+ ## Usage
20
+
21
+ You can use this model directly with a pipeline for masked language modeling:
22
+
23
+ ```python
24
+ >>> from transformers import pipeline
25
+ >>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
26
+ >>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
27
+ #Out:
28
+ # {'score': 0.8131822347640991,
29
+ # 'token': 18749,
30
+ # 'token_str': ' мағынада',
31
+ # 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
32
+ # ...
33
+ # ...]
34
+ ```
35
  ## Training data
36
 
37
  The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
38
  - [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
39
+ - [Conversational data](https://beeline.kz/) Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)
40
 
41
  Together these datasets weigh 25GB of text.
42
  ## Training procedure
 
51
 
52
  The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512.
53
 
 
 
 
54
 
55
+ ### Contributions
 
 
 
 
 
 
 
 
 
 
 
56
 
57
+ Thanks to [@BeksultanSagyndyk](https://github.com/BeksultanSagyndyk), [@SanzharMrz](https://github.com/SanzharMrz) for adding this model.
58
+ **Point of Contact:** [Sanzhar Murzakhmetov](mailto:sanzharmrz@gmail.com), [Besultan Sagyndyk](mailto:nuxyjlbka@gmail.com)
59
+ ---