ibm-fms
/

llama2-70b-accelerator

Inference Endpoints

Model card Files Files and versions Community

sahilsuneja commited on Jul 26

Commit

5489529

•

1 Parent(s): 14f26be

Update README.md

Files changed (1) hide show

README.md +29 -0

README.md CHANGED Viewed

@@ -121,4 +121,33 @@ curl 127.0.0.1:8080/generate_stream \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'
 ```

     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'
+```
+### Use in vLLM
+```
+from vllm import LLM, SamplingParams
+# Sample prompts.
+prompts = [
+    "The president of the United States is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.0)
+# Create an LLM.
+llm = LLM(
+    model="/path/to/Llama-2-70b-chat-hf",
+    tensor_parallel_size=4,
+    speculative_model="/path/to/llama2-70b-accelerator",
+    speculative_draft_tensor_parallel_size=1,
+    use_v2_block_manager=True,
+)
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```