Spaces:
Running
on
Zero
Running
on
Zero
File size: 1,514 Bytes
3860419 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# Test that the Open LLM is running
First start the server by using only CPU:
```bash
export model_path="TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf"
python -m llama_cpp.server --model $model_path
```
Or with GPU support (recommended):
```bash
python -m llama_cpp.server --model TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf --n_gpu_layers 1
```
If you have more `GPU` layers available set `--n_gpu_layers` to the higher number.
To find the amount of available run the above command and look for `llm_load_tensors: offloaded 1/41 layers to GPU` in the output.
## Test API call
Set the environment variables:
```bash
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="sk-xxx"
export MODEL_NAME="CodeLlama"
````
Then ping the model via `python` using `OpenAI` API:
```bash
python examples/open_llms/openai_api_interface.py
```
If you're not using `CodeLLama` make sure to change the `MODEL_NAME` parameter.
Or using `curl`:
```bash
curl --request POST \
--url http://localhost:8000/v1/chat/completions \
--header "Content-Type: application/json" \
--data '{ "model": "CodeLlama", "prompt": "Who are you?", "max_tokens": 60}'
```
If this works also make sure that `langchain` interface works since that's how `gpte` interacts with LLMs.
## Langchain test
```bash
export MODEL_NAME="CodeLlama"
python examples/open_llms/langchain_interface.py
```
That's it 🤓 time to go back [to](/docs/open_models.md#running-the-example) and give `gpte` a try.
|