Spaces:
Runtime error
Runtime error
Feature: llama.cpp server | |
Background: Server startup | |
Given a server listening on localhost:8080 | |
And a model url https://huggingface.co/ggml-org/models/resolve/main/bert-bge-small/ggml-model-f16.gguf | |
And a model file bert-bge-small.gguf | |
And a model alias bert-bge-small | |
And 42 as server seed | |
And 2 slots | |
# the bert-bge-small model has context size of 512 | |
# since the generated prompts are as big as the batch size, we need to set the batch size to <= 512 | |
# ref: https://huggingface.co/BAAI/bge-small-en-v1.5/blob/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/config.json#L20 | |
And 128 as batch size | |
And 128 as ubatch size | |
And 512 KV cache size | |
And enable embeddings endpoint | |
Then the server is starting | |
Then the server is healthy | |
Scenario: Embedding | |
When embeddings are computed for: | |
""" | |
What is the capital of Bulgaria ? | |
""" | |
Then embeddings are generated | |
Scenario: Embedding (error: prompt too long) | |
When embeddings are computed for: | |
""" | |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. | |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. | |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. | |
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. | |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. | |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. | |
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | |
""" | |
And embeddings request with 500 api error | |
Scenario: OAI Embeddings compatibility | |
Given a model bert-bge-small | |
When an OAI compatible embeddings computation request for: | |
""" | |
What is the capital of Spain ? | |
""" | |
Then embeddings are generated | |
Scenario: OAI Embeddings compatibility with multiple inputs | |
Given a model bert-bge-small | |
Given a prompt: | |
""" | |
In which country Paris is located ? | |
""" | |
And a prompt: | |
""" | |
Is Madrid the capital of Spain ? | |
""" | |
When an OAI compatible embeddings computation request for multiple inputs | |
Then embeddings are generated | |
Scenario: Multi users embeddings | |
Given a prompt: | |
""" | |
Write a very long story about AI. | |
""" | |
And a prompt: | |
""" | |
Write another very long music lyrics. | |
""" | |
And a prompt: | |
""" | |
Write a very long poem. | |
""" | |
And a prompt: | |
""" | |
Write a very long joke. | |
""" | |
Given concurrent embedding requests | |
Then the server is busy | |
Then the server is idle | |
Then all embeddings are generated | |
Scenario: Multi users OAI compatibility embeddings | |
Given a prompt: | |
""" | |
In which country Paris is located ? | |
""" | |
And a prompt: | |
""" | |
Is Madrid the capital of Spain ? | |
""" | |
And a prompt: | |
""" | |
What is the biggest US city ? | |
""" | |
And a prompt: | |
""" | |
What is the capital of Bulgaria ? | |
""" | |
And a model bert-bge-small | |
Given concurrent OAI embedding requests | |
Then the server is busy | |
Then the server is idle | |
Then all embeddings are generated | |
Scenario: All embeddings should be the same | |
Given 10 fixed prompts | |
And a model bert-bge-small | |
Given concurrent OAI embedding requests | |
Then all embeddings are the same | |