Zyphra
/

Zamba-7B-v1

Text Generation

Inference Endpoints

Model card Files Files and versions Community

Zamba-7B-v1 / README.md

BerenMillidge's picture

Update README.md

630c127 verified 6 months ago

|

1.73 kB

	---
	license: apache-2.0
	---
	# Model Card for Zamba 7B

	Zamba-7B-v1 is a hybrid model between Mamba, a state-space model, and transformers. It uses a mamba backbone with a shared transformer layer every 6 blocks. Zamba was trained using next-token prediction. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data sourced from open web-datasets. Subsequently in a second phase, Zamba was annealed on a mixture of 50B high-quality tokens.

	## Quick start

	### Presequities

	Zamba requires you use `transformers` version 4.39.0 or higher:
	```bash
	pip install transformers>=4.39.0
	```

	In order to run optimized Mamba implementations, you first need to install `mamba-ssm` and `causal-conv1d`:
	```bash
	pip install mamba-ssm causal-conv1d>=1.2.0
	```
	You also have to have the model on a CUDA device.

	You can run the model not using the optimized Mamba kernels, but it is not recommended as it will result in significantly higher latency. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.

	## Inference

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba-7B-v1")
	model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1", device_map="auto", torch_dtype=torch.bfloat16)

	input_text = "A funny prompt would be "
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids, max_new_tokens=100)
	print(tokenizer.decode(outputs[0]))
	```

	## Notice

	Zamba is a pretrained base model and therefore does not have any moderation mechanism.