TaylorAI/Llama2-7B-SFT-LIMA-ct2

This is a quantized version of Llama2-7B trained on the LIMA (Less is More for Alignment) dataset, located at GAIR/lima on HuggingFace. To get started using this model, you'll need to install transformers (for the tokenizer) and ctranslate2 (for the model). You'll also need huggingface_hub to easily download the weights.

pip install -U transformers ctranslate2 huggingface_hub

Next, download this repository from the Hub. You can download the files manually and place them in a folder, or use the HuggingFace library to download them programatically. Here, we're putting them in a local directory called "Llama2_TaylorAI".

from huggingface_hub import snapshot_download
snapshot_download(repo_id="TaylorAI/Llama2-7B-SFT-LIMA-ct2", local_dir="Llama2_TaylorAI")

Then, you can perform inference as follows. Note that the model was trained with the separator \n\n###\n\n between the prompt/instruction and the model's response, so to get the expected result, you'll want to append this to your prompt. The model was also trained to finish its output with the suffix @@@, so you can stop generating tokens once you reach this suffix, or use it to split the completion and keep the relevant part. All of this is shown in the example below.

from ctranslate2 import Generator
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TaylorAI/Llama2-7B-SFT-LIMA-ct2")
# point this wherever you stored this repository. if you have a GPU, use device="cuda", otherwise "cpu"
model = Generator("Llama2_TaylorAI", device="cuda")

# Unlike normal Transformers models, Ctranslate2 operates on actual "tokens" (little subword strings), not token ids (integers)
def tokenize_for_ct2(
    prompt: str,
    prompt_suffix: str,
    tokenizer: Any,
):
    full_prompt = prompt + prompt_suffix
    input_ids = tokenizer.encode(full_prompt)
    input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    return input_tokens

example_input = "What is the meaning of life?"
example_input_tokens = tokenize_for_ct2(example_input, prompt_suffix="\n\n###\n\n", tokenizer=tokenizer)

# the model returns an iterator, from which we can lazily stream tokens
result = []
it = model.generate_tokens(
  example_input_tokens, 
  max_length=1024, 
  sampling_topp=0.9,
  sampling_temperature=1.0,
  repetition_penalty=1.5
)
stop_sequence = "@@@"
for step in it:
  result.append(step.token_id)
  # stop early if we have generated the suffix
  output_so_far = tokenizer.decode(completion_tokens, skip_special_tokens=True)
  if output_so_far.endswith(stop_sequence):
    break

output = tokenizer.decode(completion_tokens, skip_special_tokens=True).split(stop_sequence)[0]
print(output)