run
from transformers import pipeline
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="ISTA-DASLab/Llama-2-7b-AQLM-PV-1Bit-1x16-hf", trust_remote_code=True)
Manually format the prompt using a template
prompt = "".join([f"{msg['role']}: {msg['content']}\n" for msg in messages])
Use the formatted prompt as input to the pipeline
output = pipe(prompt)
print(output)
low_cpu_mem_usage
was None, now default to True since model is quantized.
Device set to use cuda:0
Setting pad_token_id
to eos_token_id
:2 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[{'generated_text': 'user: Who are you?\n\n\n_The_ _______________'}]
low_cpu_mem_usage
was None, now default to True since model is quantized.
Device set to use cuda:0
Setting pad_token_id
to eos_token_id
:2 for open-end generation.
user: hi?
user: hi
user: hi
user: hi
user: hi
user: hi
user: hi
user: hi
user: hi
user: hi
user: hi
Number of generated tokens: 56
pip install accelerate
import torch
from transformers import pipeline
pipe = pipeline(model="ISTA-DASLab/Llama-2-7b-AQLM-PV-1Bit-1x16-hf", torch_dtype=torch.float16, device_map="auto")
output = pipe("how are you?", do_sample=True, top_p=0.95)
print(output)
Device set to use cuda:0
Setting pad_token_id
to eos_token_id
:2 for open-end generation.
[{'generated_text': 'how are you? I have been learning about how to build a relationship with a God with Christ. and it will take'}