Spaces:
Running
Running
metadata
title: HF LLM API
emoji: ☯️
colorFrom: gray
colorTo: gray
sdk: docker
app_port: 23333
HF-LLM-API
API for LLM inference in Huggingface spaces.
Features
✅ Implemented:
- Support Models
mixtral-8x7b
- Support OpenAI API format
- Can use api endpoint via official
openai-python
package
- Can use api endpoint via official
- Support stream response
- Support infinite-round chat
- Support Docker deployment
🔨 In progress:
- Support more models
Run API service
Run in Command Line
Install dependencies:
# pipreqs . --force --mode no-pin
pip install -r requirements.txt
Run API:
python -m apis.chat_api
Run via Docker
Docker build:
sudo docker build -t hf-llm-api:1.0 . --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
Docker run:
# no proxy
sudo docker run -p 23333:23333 hf-llm-api:1.0
# with proxy
sudo docker run -p 23333:23333 --env http_proxy="http://<server>:<port>" hf-llm-api:1.0
API Usage
Using openai-python
See: examples/chat_with_openai.py
from openai import OpenAI
# If runnning this service with proxy, you might need to unset `http(s)_proxy`.
base_url = "http://127.0.0.1:23333"
api_key = "sk-xxxxx"
client = OpenAI(base_url=base_url, api_key=api_key)
response = client.chat.completions.create(
model="mixtral-8x7b",
messages=[
{
"role": "user",
"content": "what is your model",
}
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
elif chunk.choices[0].finish_reason == "stop":
print()
else:
pass
Using post requests
See: examples/chat_with_post.py
import ast
import httpx
import json
import re
# If runnning this service with proxy, you might need to unset `http(s)_proxy`.
chat_api = "http://127.0.0.1:23333"
api_key = "sk-xxxxx"
requests_headers = {}
requests_payload = {
"model": "mixtral-8x7b",
"messages": [
{
"role": "user",
"content": "what is your model",
}
],
"stream": True,
}
with httpx.stream(
"POST",
chat_api + "/chat/completions",
headers=requests_headers,
json=requests_payload,
timeout=httpx.Timeout(connect=20, read=60, write=20, pool=None),
) as response:
# https://docs.aiohttp.org/en/stable/streams.html
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_stream_completions.ipynb
response_content = ""
for line in response.iter_lines():
remove_patterns = [r"^\s*data:\s*", r"^\s*\[DONE\]\s*"]
for pattern in remove_patterns:
line = re.sub(pattern, "", line).strip()
if line:
try:
line_data = json.loads(line)
except Exception as e:
try:
line_data = ast.literal_eval(line)
except:
print(f"Error: {line}")
raise e
# print(f"line: {line_data}")
delta_data = line_data["choices"][0]["delta"]
finish_reason = line_data["choices"][0]["finish_reason"]
if "role" in delta_data:
role = delta_data["role"]
if "content" in delta_data:
delta_content = delta_data["content"]
response_content += delta_content
print(delta_content, end="", flush=True)
if finish_reason == "stop":
print()