Update README.md
Browse files
README.md
CHANGED
@@ -18,11 +18,46 @@ should probably proofread and complete it, then remove this comment. -->
|
|
18 |
|
19 |
# TinySatirik-m
|
20 |
|
21 |
-
This model is a fine-tuned version of [](https://huggingface.co/) on an
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
## Model description
|
24 |
|
25 |
-
|
26 |
|
27 |
## Intended uses & limitations
|
28 |
|
|
|
18 |
|
19 |
# TinySatirik-m
|
20 |
|
21 |
+
This model is a fine-tuned version of [](https://huggingface.co/) on an [anekdots](https://huggingface.co/datasets/igorktech/anekdots) dataset.
|
22 |
+
|
23 |
+
## Tokenizer
|
24 |
+
|
25 |
+
To utilize the model, install the [special tokenizer](https://github.com/Koziev/character-tokenizer):
|
26 |
+
|
27 |
+
```bash
|
28 |
+
pip install git+https://github.com/Koziev/character-tokenizer
|
29 |
+
```
|
30 |
+
|
31 |
+
In addition to recognizing Cyrillic characters and punctuation, this tokenizer is aware of special tokens such as ```<s>```, ```</s>```, ```<pad>```, and ```<unk>```.
|
32 |
+
|
33 |
+
As this is a non-standard tokenizer for transformers, load it not via ```transformers.AutoTokenizer.from_pretrained```, but somewhat like this:
|
34 |
+
|
35 |
+
```python
|
36 |
+
import charactertokenizer
|
37 |
+
|
38 |
+
...
|
39 |
+
tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained('igorktech/CharPicoSatirik-m')
|
40 |
+
```
|
41 |
+
|
42 |
+
To observe tokenization, use this code snippet:
|
43 |
+
|
44 |
+
```python
|
45 |
+
prompt = '<s>Hello World\n'
|
46 |
+
encoded_prompt = tokenizer.encode(prompt, return_tensors='pt')
|
47 |
+
print('Tokenized prompt:', ' | '.join(tokenizer.decode([t]) for t in encoded_prompt[0]))
|
48 |
+
```
|
49 |
+
|
50 |
+
You will see a list of tokens separated by the ```|``` symbol:
|
51 |
+
|
52 |
+
```
|
53 |
+
Tokenized prompt: <s> | H | e | l | l | o | | W | o | r | l | d |
|
54 |
+
```
|
55 |
+
|
56 |
+
Tokenizer created by [Koziev](https://github.com/Koziev).
|
57 |
|
58 |
## Model description
|
59 |
|
60 |
+
Llama2 architecture based.
|
61 |
|
62 |
## Intended uses & limitations
|
63 |
|