Update README.md

7d2b123 verified about 2 months ago

6.86 kB

	---
	library_name: transformers
	license: llama3.2
	license_link: https://huggingface.co/meta-llama/Llama-3.2-3B/blob/main/LICENSE.txt
	base_model: meta-llama/Llama-3.2-3B
	datasets:
	- macadeliccc/US-SupremeCourtVerdicts
	- macadeliccc/US-FederalLaws
	tags:
	- generated_from_trainer
	- llama-3
	- spectrum
	- axolotl
	language:
	- en
	pipeline_tag: text-generation
	---
	# Magistrate 3.2 3B

	Continued pretraining applied to [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) using no synthetic legal data. ~250M tokens.

	The model achieves the following results on the evaluation set:
	- Loss: 0.6802

	Instruct version is available [here]()

	[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
	<details><summary>See axolotl config</summary>

	axolotl version: `0.4.1`
	```yaml
	base_model: meta-llama/Llama-3.2-3B
	model_type: LlamaForCausalLM
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	load_in_4bit: false
	strict: false

	datasets:
	- path: json
	data_files: "data/amendments_with_content_converted.json"
	type: completion
	- path: json
	data_files: "data/federal_rules_converted.json"
	type: completion
	- path: json
	data_files: "data/cornell_legal_encyclopedias_converted.json"
	type: completion
	- path: json
	data_files: "data/pocket_guide_for_judges_converted.json"
	type: completion
	- path: json
	data_files: "data/us_federal_code.json"
	type: completion
	- path: json
	data_files: "data/us_supreme_court_summaries_converted.json"
	type: completion
	- path: json
	data_files: "data/us_supreme_court_converted.json"
	type: completion
	- path: json
	data_files: "data/ucfr.json"
	type: completion
	- path: json
	data_files: "data/map-code-filtered.json"
	type: completion

	dataset_prepared_path:
	val_set_size: 0.05
	output_dir: ./outputs/lora-out

	sequence_len: 8192
	sample_packing: true
	eval_sample_packing: false
	pad_to_sequence_len: true

	# adapter: lora
	# lora_model_dir:
	# lora_r: 128
	# lora_alpha: 32
	# lora_dropout: 0.05
	# lora_target_linear: true
	# lora_fan_in_fan_out:
	# lora_modules_to_save:
	# - embed_tokens
	# - lm_head

	unfrozen_parameters:
	- ^lm_head.weight$
	- ^model.embed_tokens.weight$
	# mlp.down_proj layers
	- model.layers.0.mlp.down_proj
	- model.layers.1.mlp.down_proj
	- model.layers.17.mlp.down_proj
	- model.layers.19.mlp.down_proj
	- model.layers.18.mlp.down_proj
	- model.layers.5.mlp.down_proj
	- model.layers.20.mlp.down_proj
	- model.layers.2.mlp.down_proj
	- model.layers.4.mlp.down_proj
	- model.layers.6.mlp.down_proj
	- model.layers.3.mlp.down_proj
	- model.layers.16.mlp.down_proj
	- model.layers.15.mlp.down_proj
	- model.layers.13.mlp.down_proj
	# mlp.gate_proj layers
	- model.layers.0.mlp.gate_proj
	- model.layers.1.mlp.gate_proj
	- model.layers.2.mlp.gate_proj
	- model.layers.3.mlp.gate_proj
	- model.layers.22.mlp.gate_proj
	- model.layers.21.mlp.gate_proj
	- model.layers.20.mlp.gate_proj
	- model.layers.23.mlp.gate_proj
	- model.layers.19.mlp.gate_proj
	- model.layers.4.mlp.gate_proj
	- model.layers.18.mlp.gate_proj
	- model.layers.17.mlp.gate_proj
	- model.layers.5.mlp.gate_proj
	- model.layers.24.mlp.gate_proj
	# mlp.up_proj layers
	- model.layers.4.mlp.up_proj
	- model.layers.3.mlp.up_proj
	- model.layers.5.mlp.up_proj
	- model.layers.6.mlp.up_proj
	- model.layers.7.mlp.up_proj
	- model.layers.2.mlp.up_proj
	- model.layers.8.mlp.up_proj
	- model.layers.14.mlp.up_proj
	- model.layers.13.mlp.up_proj
	- model.layers.11.mlp.up_proj
	- model.layers.9.mlp.up_proj
	- model.layers.1.mlp.up_proj
	- model.layers.15.mlp.up_proj
	- model.layers.12.mlp.up_proj
	# self_attn.k_proj layers
	- model.layers.25.self_attn.k_proj
	- model.layers.22.self_attn.k_proj
	- model.layers.19.self_attn.k_proj
	- model.layers.20.self_attn.k_proj
	- model.layers.17.self_attn.k_proj
	- model.layers.24.self_attn.k_proj
	- model.layers.23.self_attn.k_proj
	- model.layers.18.self_attn.k_proj
	- model.layers.21.self_attn.k_proj
	- model.layers.27.self_attn.k_proj
	- model.layers.15.self_attn.k_proj
	- model.layers.10.self_attn.k_proj
	- model.layers.6.self_attn.k_proj
	- model.layers.5.self_attn.k_proj
	# self_attn.o_proj layers

	wandb_project:
	wandb_entity:
	wandb_watch:
	wandb_name:
	wandb_log_model:

	gradient_accumulation_steps: 4
	micro_batch_size: 2
	num_epochs: 3
	optimizer: paged_adamw_32bit

	# Gradient clipping max norm
	max_grad_norm: 1.0
	noisy_embedding_alpha: 0 # no noisy embedding to ensure maximal memorization


	lr_scheduler: cosine
	learning_rate: 0.0002
	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: false

	gradient_checkpointing: true
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true
	s2_attention:

	warmup_steps: 690
	evals_per_epoch: 2
	eval_table_size:
	eval_max_new_tokens: 128
	saves_per_epoch: 1
	debug:
	deepspeed: deepspeed_configs/zero3.json
	weight_decay: 0.0
	fsdp:
	fsdp_config:
	special_tokens:
	pad_token: <\|end_of_text\|>

	```

	</details><br>




	## Model description

	This is a base model trained on US Supreme Court proceedings, US federal code and regulations.

	## Intended uses & limitations

	This model is intended for research purposes. You are liable for all model outputs.

	## Training and evaluation data

	The training data consists of US Supreme Court verdicts, federal regulations, laws and treaties.

	Some other resources have been included from institutions like CLL to fill in the gaps in knowledge for industry jargon.

	## Training procedure

	Spectrum top 35% fine tune. Thanks to the cognitive computations team for the work done on spectrum.

	Methodology based on Cohere's paper: [To Code, or Not To Code? Exploring Impact of Code in Pre-training](https://arxiv.org/abs/2408.10914)

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0002
	- train_batch_size: 2
	- eval_batch_size: 2
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 2
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 16
	- total_eval_batch_size: 4
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 690
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 1.3589 \| 0.0004 \| 1 \| 1.5640 \|
	\| 0.9936 \| 0.4984 \| 1154 \| 0.9440 \|
	\| 0.8384 \| 0.9968 \| 2308 \| 0.8392 \|
	\| 0.8226 \| 1.4963 \| 3462 \| 0.7802 \|
	\| 0.6568 \| 1.9949 \| 4616 \| 0.7059 \|
	\| 0.5163 \| 2.4923 \| 5770 \| 0.6886 \|
	\| 0.492 \| 2.9922 \| 6924 \| 0.6802 \|


	### Framework versions

	- Transformers 4.45.0
	- Pytorch 2.3.1+cu121
	- Datasets 2.21.0
	- Tokenizers 0.20.0