pszemraj
/

jamba-900M-v0.13-KIx2

Text Generation

Model card Files Files and versions Community

jamba-900M-v0.13-KIx2 / README.md

pszemraj's picture

Update README.md

ffb78a7 verified 6 months ago

|

history blame contribute delete

3.36 kB

	---
	license: apache-2.0
	tags:
	- jamba
	- smol MoE
	- smol
	metrics:
	- accuracy
	datasets:
	- BEE-spoke-data/knowledge-inoc-concat-v1
	- BEE-spoke-data/wikipedia-20230901.en-deduped
	- BEE-spoke-data/fineweb-100k_en-med
	- BEE-spoke-data/fineweb-1M_en-med
	- BEE-spoke-data/fineweb-1M_longish
	language:
	- en
	inference: false
	---

	# jamba-900M-v0.13-KIx2

	<a href="https://colab.research.google.com/gist/pszemraj/62d037d0d93656ef2101d7e29e3b7220/jamba-test-sandbox.ipynb">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
	</a>

	> The API widget is off as it isn't supported by hf yet - try the Colab

	This is a pretraining experiment on the `jamba` arch as a "smol MoE".

	Details:

	- pretrained at context length 16384
	- seen approx 20b tokens
	- uses Claude3 tokenizer (as hf GPT2 tokenizer)
	- hidden size 1024, 12 layers, 8 experts

	achieves the following results on the evaluation set (_most recent dataset_):
	- Loss: 3.0366
	- Accuracy: 0.4514
	- Num Input Tokens Seen: 1975517184

	if I pretrain it further, other versions will be in new repos with incremented version (this is v0.13)

	## Quick eval

	Quick eval for: pszemraj/jamba-H1024_L12-v0.13-KIx2


	hf (pretrained=pszemraj/jamba-H1024_L12-v0.13-KIx2,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: 0.9999, num_fewshot: None, batch_size: 8

	\| Tasks \|Version\|Filter\|n-shot\| Metric \| Value \| \|Stderr\|
	\|--------------\|------:\|------\|-----:\|----------\|-------:\|---\|-----:\|
	\|winogrande \| 1\|none \| 0\|acc \| 0.5067\|± \|0.0141\|
	\|piqa \| 1\|none \| 0\|acc \| 0.5912\|± \|0.0138\|
	\| \| \|none \| 0\|acc_norm \| 0.5951\|± \|0.0138\|
	\|openbookqa \| 1\|none \| 0\|acc \| 0.1800\|± \|0.0172\|
	\| \| \|none \| 0\|acc_norm \| 0.2920\|± \|0.0204\|
	\|lambada_openai\| 1\|none \| 0\|perplexity\|103.1241\|± \|8.5843\|
	\| \| \|none \| 0\|acc \| 0.2502\|± \|0.0122\|
	\|boolq \| 2\|none \| 0\|acc \| 0.6196\|± \|0.0136\|
	\|arc_easy \| 1\|none \| 0\|acc \| 0.3836\|± \|0.0137\|
	\| \| \|none \| 0\|acc_norm \| 0.3694\|± \|0.0136\|

	## example outputs



	![image/png](https://cdn-uploads.huggingface.co/production/uploads/60bccec062080d33f875cd0c/wky-qjUtS0AJ6YtIsJh3T.png)

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 80085
	- gradient_accumulation_steps: 32
	- total_train_batch_size: 128
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 2.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| Input Tokens Seen \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:--------:\|:-----------------:\|
	\| 3.2013 \| 0.4241 \| 200 \| 3.0653 \| 0.4479 \| 419430400 \|
	\| 3.1976 \| 0.8481 \| 400 \| 3.0434 \| 0.4506 \| 838860800 \|
	\| 3.1485 \| 1.2722 \| 600 \| 3.0375 \| 0.4513 \| 1258291200 \|
	\| 3.1871 \| 1.6963 \| 800 \| 3.0366 \| 0.4514 \| 1677721600 \|


	### Framework versions

	- Transformers 4.40.1
	- Pytorch 2.2.0+cu121
	- Datasets 2.19.0
	- Tokenizers 0.19.1