--- license: llama3.1 datasets: - nothingiisreal/Reddit-Dirty-And-WritingPrompts - Nopm/Opus_WritingStruct - kalomaze/Opus_Instruct_25k - Gryphe/Sonnet3.5-SlimOrcaDedupCleaned --- Gate lifted, yay! People liked the model even though its a test model thats underfit, still cost us 80 USD tho lmao. BF16 is [here](https://huggingface.co/nothingiisreal/L3.1-70B-Celeste-V0.1-BF16) Please do give the V1.9 card a read [here](https://huggingface.co/nothingiisreal/MN-12B-Celeste-V1.9) Recommended system prompt is same as V1.9 70B seems to have a bit more GPT-ish terminology than 12B, but also less slopping. It is still less than other 70Bs. Temp 1.25 seems to improve the prose, recommended sampler: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/630cf5d14ca0a22768bbe10c/5BkFd5FromVfT8ZeTml_2.png) It seems to be way more coherent and aware of whats going on as well as more intelligent. The model seems to give out what you give in, sloppy card or first message leads to more of the same. The model is quite good at taking a human written card with stuff like conversational narration, and then continue that style. It was trained on 4xH100 NVL for 6 hours using Lora+. I still want to train it further because it seems like the more data we put in, the better the model gets at writing and roleplaying. Test and see I guess. Me and my teammate are sick rn xD and I am currently working with another teammate on some good stuff, we can finally break away from AI generated datasets, at least for the most part. Once it is done, the 8B, 12B and 70B will be used with that dataset to train with. I hope we succeed at this, it will make me so, so happy. We are also experimenting with RLHF, KTO and PPO mainly. When we do a proper release, it will have a lot of writeup. --- Datasets used: # Name, sample size, whether to force RP format, whether to apply len limit (for the first message, seq len limit is always applied), unkown_boolean, minimum message count, system message - Reddit WP ["reddit_writing_prompts.jsonl", 0.4, True, True, False, 2, "Write a story based on prompt provided by user below. Mode: SFW"],
- Instruct ["combined_25k_HOTFIX_declauded_englishonly_sysprompt_name_swap.jsonl", 0.1, False, True, False, 2, ""],
["slim-orca.json", 0.1, False, True, False, 2, ""], - Synth story ["writing-struct-deslopped.json", 0.1, False, True, False, 2, ""],
Claude RP 0.8 Thank you Nopm, Gryphe (double thanks), and kalomaze, and any other people involved in making those datasets. r/DirtyWritingPrompts was dropped because it would induce undesirable features. No worries though, NSFW will be stronger than ever lmao. We used 10,000 rows, so take those ratios, normalise them so they add up to 1 and then that will be the division of the dataset. You can find all datasets by googling them, they are on huggingface, Claude RP is c2 logs but we filtered it ourselves. --- Axolotl Config: ```yaml # Model base_model: meta-llama/Meta-Llama-3.1-70B-Instruct model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: true # Output and HuggingFace output_dir: /workspace/data/train-results/trained_model hub_model_id: hf_use_auth_token: true hub_strategy: "all_checkpoints" # WandB wandb_project: huggingface wandb_entity: # Data chat_template: llama3 train_on_inputs: false group_by_length: true datasets: - path: type: sharegpt roles: input: - system - user output: - assistant ## Evaluation val_set_size: 0.01 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 # Technical aspects sequence_len: 8192 save_safetensors: true saves_per_epoch: 2 logging_steps: 1 special_tokens: pad_token: <|end_of_text|> # Quantization bf16: auto fp16: tf32: false ## For LoRA load_in_8bit: false load_in_4bit: true # LoRA adapter: qlora # or qlora lora_model_dir: lora_r: 256 lora_alpha: 256 lora_dropout: 0.1 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules: loraplus_lr_ratio: 8 loraplus_lr_embedding: # Training hyperparameters # max_steps: num_epochs: 1 # TODO Perhaps reduce this because LORA+ only needs 1 epoch. # Anti Overfit and Stability weight_decay: 0.01 max_grad_norm: 1.0 # Might increase this to 15 or something. ## Learning Rate warmup_ratio: 0.05 learning_rate: 0.000008 lr_scheduler: cosine_with_min_lr lr_scheduler_kwargs: min_lr: 0.0000024 optimizer: paged_adamw_8bit # usually adamw_torch or paged_adamw_8bit ## Batch Size gradient_accumulation_steps: 1 micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps eval_batch_size: 1 # Optimizations pad_to_sequence_len: true sample_packing: true eval_sample_packing: true flash_attention: true xformers_attention: gradient_checkpointing: "unsloth" gradient_checkpointing_kwargs: use_reentrant: true local_rank: deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all # Misc early_stopping_patience: debug: ```