Text Generation
Transformers
Safetensors
zamba
pglo commited on
Commit
fcace37
·
verified ·
1 Parent(s): d2446d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -1
README.md CHANGED
@@ -45,6 +45,40 @@ outputs = model.generate(**input_ids, max_new_tokens=100)
45
  print(tokenizer.decode(outputs[0]))
46
  ```
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ## Citation
49
 
50
  If you find Zamba useful in your work please cite it as:
@@ -60,4 +94,4 @@ If you find Zamba useful in your work please cite it as:
60
 
61
  ## Notice
62
 
63
- Zamba is a pretrained base model and therefore does not have any moderation mechanism.
 
45
  print(tokenizer.decode(outputs[0]))
46
  ```
47
 
48
+ To load a different checkpoint use, e.g., for iteration 2500,
49
+
50
+ ```python
51
+ model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1-phase1", device_map="auto", torch_dtype=torch.bfloat16, revision="iter2500")
52
+ ```
53
+
54
+ The default iteration is the fully trained phase 1 model, corresponding to iteration 462070. This is the number of iterations performed by training the model starting from random initialization. See [arXiv:2405.16712](https://arxiv.org/abs/2405.16712) for more details on training.
55
+
56
+ ## Model Details
57
+
58
+ Zamba utilizes a unique hybrid SSM architecture. This architecture consists of a backbone of Mamba layers interspersed with a shared attention layer. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth.
59
+
60
+
61
+ <center>
62
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/IGK562oVTFSOQbpLavu7E.png" width="300" alt="Zamba architecture">
63
+ </center>
64
+
65
+
66
+ ## Performance
67
+
68
+ We find that Zamba performs significantly better than existing open models (with open datasets and training details) at this scale. However, it performs slightly worse than the leading open-weight models at the 7B scale. Most of this difference derives from MMLU and reasoning evaluations. Zamba, however, is trained on significantly fewer tokens than these models and is the most sample efficient model in terms of performance per training tokens.
69
+
70
+
71
+ <center>
72
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64e40335c0edca443ef8af3e/HUujLyiYpPgwz_fpdw0iG.png" width="350" alt="Zamba performance">
73
+ </center>
74
+
75
+
76
+ Due to its SSM architecture, Zamba is extremely efficient in inference, substantially outperforming comparable 7B and 8B models in inference latency as well as memory cost of generation due to its substantially diminished KV cache.
77
+
78
+ <center>
79
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/cghYPnDbdzweT1b2RyiXA.png" width="400" alt="Zamba performance">
80
+ </center>
81
+
82
  ## Citation
83
 
84
  If you find Zamba useful in your work please cite it as:
 
94
 
95
  ## Notice
96
 
97
+ Zamba is a pretrained base model and therefore does not have any moderation mechanism. In addition, one should not expect good chat performance, as this model was not fine-tuned for chat.