Edit model card

TinySatirik-m

This model is a pre-trained version of really tiny LLama2 model on an anekdots dataset.

Inspired by TinyStories.

Tokenizer

To utilize the model, install the special tokenizer:

pip install git+https://github.com/Koziev/character-tokenizer

In addition to recognizing Cyrillic characters and punctuation, this tokenizer is aware of special tokens such as <s>, </s>, <pad>, and <unk>.

As this is a non-standard tokenizer for transformers, load it not via transformers.AutoTokenizer.from_pretrained, but somewhat like this:

import charactertokenizer

...
tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained('igorktech/CharPicoSatirik-m')

To observe tokenization, use this code snippet:

prompt = '<s>Hello World\n'
encoded_prompt = tokenizer.encode(prompt, return_tensors='pt')
print('Tokenized prompt:', ' | '.join(tokenizer.decode([t]) for t in encoded_prompt[0]))

You will see a list of tokens separated by the | symbol:

Tokenized prompt: <s> | H | e | l | l | o |   | W | o | r | l | d | 

Tokenizer created by Koziev.

Model description

Llama2 architecture based.

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0005
  • train_batch_size: 4
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 250
  • num_epochs: 2
  • mixed_precision_training: Native AMP

Training results

Framework versions

  • Transformers 4.36.0.dev0
  • Pytorch 2.1.0+cu121
  • Datasets 2.16.1
  • Tokenizers 0.15.0
Downloads last month
7
Safetensors
Model size
15.3M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train igorktech/CharPicoSatirik-m