Prompt and performance
#1
by
Sciumo
- opened
It was unclear what the prompt should be. Shouldn't model cards have the associated prompts?
Here is what I used:
template = """{instruct}
USER: {question}
ASSISTANT:
"""
The performance was 446.87 ms per token on a TR Pro 3995 with 64 cores and 256 GB RAM. I classify that as slow.
Apparently CPU doesn't really help, with a single NUMA just spins on memory access. I'm going to try https://github.com/huggingface/text-generation-inference