--- license: apache-2.0 base_model: mistralai/Mistral-7B-v0.3 tags: - axolotl - generated_from_trainer model-index: - name: Mistral-7B-v0.3-sarcasm-scrolls-v2 results: [] datasets: - BEE-spoke-data/sarcasm-scrolls language: - en --- [Built with Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
See axolotl config axolotl version: `0.4.1` ```yaml base_model: mistralai/Mistral-7B-v0.3 model_type: MistralForCausalLM tokenizer_type: LlamaTokenizer strict: false # dataset datasets: - path: BEE-spoke-data/sarcasm-scrolls type: completion # format from earlier field: text val_set_size: 200 sequence_len: 4096 sample_packing: true pad_to_sequence_len: true train_on_inputs: false group_by_length: false # WANDB wandb_project: sarcasm-scrolls wandb_entity: pszemraj wandb_watch: gradients wandb_name: Mistral-7B-v0.3-sarcasm-scrolls-v2a hub_model_id: pszemraj/Mistral-7B-v0.3-sarcasm-scrolls-v2 hub_strategy: every_save gradient_accumulation_steps: 32 micro_batch_size: 1 num_epochs: 2 optimizer: adamw_torch_fused # paged_adamw_32bit lr_scheduler: cosine learning_rate: 2e-5 load_in_8bit: false load_in_4bit: false bf16: true tf32: true torch_compile: true torch_compile_backend: inductor # Optional[str] gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false early_stopping_patience: logging_steps: 3 xformers_attention: flash_attention: true warmup_steps: 20 # hyperparams for freq of evals, saving, etc evals_per_epoch: 4 saves_per_epoch: 4 save_safetensors: true save_total_limit: 1 # Checkpoints saved at a time output_dir: ./output-axolotl/output-model-chaz resume_from_checkpoint: deepspeed: weight_decay: 0.06 special_tokens: ```

# Mistral-7B-v0.3-sarcasm-scrolls-v2 ## Model description This model is a fine-tuned version of [mistralai/Mistral-7B-v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) on the BEE-spoke-data/sarcasm-scrolls dataset. It achieves the following results on the evaluation set: - Loss: 2.3333 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - gradient_accumulation_steps: 32 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 20 - num_epochs: 2 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | No log | 0.0075 | 1 | 2.3935 | | 2.3672 | 0.2548 | 34 | 2.3638 | | 2.3751 | 0.5096 | 68 | 2.3499 | | 2.308 | 0.7644 | 102 | 2.3238 | | 2.2672 | 1.0035 | 136 | 2.3027 | | 1.702 | 1.2583 | 170 | 2.3449 | | 1.7456 | 1.5131 | 204 | 2.3370 | | 1.7004 | 1.7679 | 238 | 2.3333 | ### Framework versions - Transformers 4.41.1 - Pytorch 2.3.1+cu118 - Datasets 2.19.1 - Tokenizers 0.19.1