Zyphra
/

Zamba2-2.7B-instruct

Text Generation

Inference Endpoints

Model card Files Files and versions Community

BerenMillidge commited on Sep 19

Commit

88bc14a

•

1 Parent(s): 07bf624

Update README.md

Files changed (1) hide show

README.md +0 -7

README.md CHANGED Viewed

@@ -22,7 +22,6 @@ You can run the model without using the optimized Mamba kernels, but it is **not
 To run on CPU, please specify `use_mamba_kernels=False` when loading the model using ``AutoModelForCausalLM.from_pretrained``.
 ### Inference
 ```python
@@ -46,7 +45,6 @@ outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generat
 print((tokenizer.decode(outputs[0])))
 ```
 ## Performance
 Zamba2-2.7B-Instruct punches dramatically above its weight, achieving extremely strong instruction-following benchmark scores, significantly outperforming Gemma2-2B-Instruct of the same size and outperforming Mistral-7B-Instruct in most metrics.
@@ -54,11 +52,6 @@ Zamba2-2.7B-Instruct punches dramatically above its weight, achieving extremely
 <img src="https://cdn-uploads.huggingface.co/production/uploads/64e40335c0edca443ef8af3e/wXFMLXZA2-xz2PDyUMwTI.png" width="600"/>
 Moreover, due to its unique hybrid SSM architecture, Zamba2-2.7B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer based models.
-<center>
-<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/U7VD9PYLj3XcEjgV08sP5.png" width="700" alt="Zamba performance">
-</center>
 Time to First Token (TTFT)             |  Output Generation
 :-------------------------:|:-------------------------:

 To run on CPU, please specify `use_mamba_kernels=False` when loading the model using ``AutoModelForCausalLM.from_pretrained``.
 ### Inference
 ```python
 print((tokenizer.decode(outputs[0])))
 ```
 ## Performance
 Zamba2-2.7B-Instruct punches dramatically above its weight, achieving extremely strong instruction-following benchmark scores, significantly outperforming Gemma2-2B-Instruct of the same size and outperforming Mistral-7B-Instruct in most metrics.
 <img src="https://cdn-uploads.huggingface.co/production/uploads/64e40335c0edca443ef8af3e/wXFMLXZA2-xz2PDyUMwTI.png" width="600"/>
 Moreover, due to its unique hybrid SSM architecture, Zamba2-2.7B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer based models.
 Time to First Token (TTFT)             |  Output Generation
 :-------------------------:|:-------------------------: