File size: 2,763 Bytes
ad1eeda 7799ac0 ad1eeda 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 98dd319 7799ac0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
---
license: llama2
language:
- en
library_name: transformers
datasets:
- togethercomputer/llama-instruct
---
# LLaMA-2-7B-32K-Chat
## Model Description
LLaMA-2-7B-32K-Chat is an open-source, long-context chat model finetuned from [Llama-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K), over high-quality instructions and chat data.
We build Llama-2-7B-32K-Chat with less than 200 lines of Python script using [Together API](https://together.ai/blog/api-announcement), and we also make the recipe fully available.
We hope that this can enable everyone to finetune their own version of [Llama-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) — play with [Together API](https://together.ai/blog/api-announcement) and give us feedback!
Llama-2-7B-32K-Chat is fine-tuned over 19K single- and multi-round conversations generated by human instructions and Llama-2-70B-Chat outputs,
The dataset is also released [here](https://huggingface.co/datasets/togethercomputer/llama-instruct).
## Inference
You can use the [Together API](https://together.ai/blog/api-announcement) to try out LLaMA-2-7B-32K-Chat for inference.
The updated inference stack allows for efficient inference.
To run the model locally, we strongly recommend to install Flash Attention V2, which is necessary to obtain the best performance:
```
# Please update the path of `CUDA_HOME`
export CUDA_HOME=/usr/local/cuda-11.8
pip install transformers==4.31.0
pip install sentencepiece
pip install ninja
pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
```
You can use this model directly from the Hugging Face Model Hub or fine-tune it on your own data using the OpenChatKit.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)
input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
```
Alternatively, you can set `trust_remote_code=False` if you prefer not to use flash attention.
To chat with the model, the prompt is in the format of
```
[INST] Write a song about elepants [\INST]
```
## Limitations and Bias
As with all language models, LLaMA-2-7B-32K-Chat may generate incorrect or biased content. It's important to keep this in mind when using the model.
## Community
Join us on [Together Discord](https://discord.gg/6ZVDU8tTD4) |