This is a quantized version of Llama2-7B trained on the LIMA (Less is More for Alignment) dataset, located at GAIR/lima
on HuggingFace.
To get started using this model, you'll need to install transformers
(for the tokenizer) and ctranslate2
(for the model). You'll
also need huggingface_hub
to easily download the weights.
pip install -U transformers ctranslate2 huggingface_hub
Next, download this repository from the Hub. You can download the files manually and place them in a folder, or use the HuggingFace library to download them programatically. Here, we're putting them in a local directory called "Llama2_TaylorAI".
from huggingface_hub import snapshot_download
snapshot_download(repo_id="TaylorAI/Llama2-7B-SFT-LIMA-ct2", local_dir="Llama2_TaylorAI")
Then, you can perform inference as follows. Note that the model was trained with the separator \n\n###\n\n
between the prompt/instruction
and the model's response, so to get the expected result, you'll want to append this to your prompt. The model was also trained to finish its
output with the suffix @@@
, so you can stop generating tokens once you reach this suffix, or use it to split the completion and keep the
relevant part. All of this is shown in the example below.
from ctranslate2 import Generator
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TaylorAI/Llama2-7B-SFT-LIMA-ct2")
# point this wherever you stored this repository. if you have a GPU, use device="cuda", otherwise "cpu"
model = Generator("Llama2_TaylorAI", device="cuda")
# Unlike normal Transformers models, Ctranslate2 operates on actual "tokens" (little subword strings), not token ids (integers)
def tokenize_for_ct2(
prompt: str,
prompt_suffix: str,
tokenizer: Any,
):
full_prompt = prompt + prompt_suffix
input_ids = tokenizer.encode(full_prompt)
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
return input_tokens
example_input = "What is the meaning of life?"
example_input_tokens = tokenize_for_ct2(example_input, prompt_suffix="\n\n###\n\n", tokenizer=tokenizer)
# the model returns an iterator, from which we can lazily stream tokens
result = []
it = model.generate_tokens(
example_input_tokens,
max_length=1024,
sampling_topp=0.9,
sampling_temperature=1.0,
repetition_penalty=1.5
)
stop_sequence = "@@@"
for step in it:
result.append(step.token_id)
# stop early if we have generated the suffix
output_so_far = tokenizer.decode(completion_tokens, skip_special_tokens=True)
if output_so_far.endswith(stop_sequence):
break
output = tokenizer.decode(completion_tokens, skip_special_tokens=True).split(stop_sequence)[0]
print(output)
- Downloads last month
- 2