igorktech commited on
Commit
632ad76
·
verified ·
1 Parent(s): 26b1a54

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -2
README.md CHANGED
@@ -18,11 +18,46 @@ should probably proofread and complete it, then remove this comment. -->
18
 
19
  # TinySatirik-m
20
 
21
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Model description
24
 
25
- More information needed
26
 
27
  ## Intended uses & limitations
28
 
 
18
 
19
  # TinySatirik-m
20
 
21
+ This model is a fine-tuned version of [](https://huggingface.co/) on an [anekdots](https://huggingface.co/datasets/igorktech/anekdots) dataset.
22
+
23
+ ## Tokenizer
24
+
25
+ To utilize the model, install the [special tokenizer](https://github.com/Koziev/character-tokenizer):
26
+
27
+ ```bash
28
+ pip install git+https://github.com/Koziev/character-tokenizer
29
+ ```
30
+
31
+ In addition to recognizing Cyrillic characters and punctuation, this tokenizer is aware of special tokens such as ```<s>```, ```</s>```, ```<pad>```, and ```<unk>```.
32
+
33
+ As this is a non-standard tokenizer for transformers, load it not via ```transformers.AutoTokenizer.from_pretrained```, but somewhat like this:
34
+
35
+ ```python
36
+ import charactertokenizer
37
+
38
+ ...
39
+ tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained('igorktech/CharPicoSatirik-m')
40
+ ```
41
+
42
+ To observe tokenization, use this code snippet:
43
+
44
+ ```python
45
+ prompt = '<s>Hello World\n'
46
+ encoded_prompt = tokenizer.encode(prompt, return_tensors='pt')
47
+ print('Tokenized prompt:', ' | '.join(tokenizer.decode([t]) for t in encoded_prompt[0]))
48
+ ```
49
+
50
+ You will see a list of tokens separated by the ```|``` symbol:
51
+
52
+ ```
53
+ Tokenized prompt: <s> | H | e | l | l | o | | W | o | r | l | d |
54
+ ```
55
+
56
+ Tokenizer created by [Koziev](https://github.com/Koziev).
57
 
58
  ## Model description
59
 
60
+ Llama2 architecture based.
61
 
62
  ## Intended uses & limitations
63