How can I run this?
Hey, I'm new to hugging face and I tried to create a GPU gradio space using the deploy button, but it doesn't work like it does in this main repo.
What am I missing?
https://github.com/oobabooga/text-generation-webui/issues/253
It takes 13213MiB VRAM in 8-bit mode.
Juste use the HuggingFace text-generation
client:
pip install text-generation
from text_generation import InferenceAPIClient
client = InferenceAPIClient("OpenAssistant/oasst-sft-1-pythia-12b")
text = client.generate("<|prompter|>Why is the sky blue?<|endoftext|><|assistant|>").generated_text
print(text)
# Token Streaming
text = ""
for response in client.generate_stream("<|prompter|>Why is the sky blue?<|endoftext|><|assistant|>"):
if not response.token.special:
print(response.token.text)
text += response.token.text
print(text)
Thanks!
It seems we have a small issue with the model we are stopping the api-inference support for now while we figure out what's wrong.
Any estimated time when it will come back?
The issue is quite deep in Transformers. You can track its evolution here: https://github.com/huggingface/transformers/issues/22161
While we find a solution, inference API will unfortunately stay off for this model.
The issue should be fixed and the inference API is back online.
If you see any weird outputs were the model seem to always repeat the same token, please inform us here.
@olivierdehaene Is there a way to generate multiple sequence with client.generete call? (such as setting num_return_sequences parameter to a number greater than 1 in model.generate() call)
You can use the best of parameter.
Or simply do multiple calls.
@olivierdehaene thanks for your reply. I can increase best of parameter up to 2. Another quesiton is, how can I increase the max length of generated text? It seems like the client.generate() method doesn't have "max_length" parameter.
from text_generation import Client
Client.generate?
Gives you:
Signature:
Client.generate(
self,
prompt: str,
do_sample: bool = False,
max_new_tokens: int = 20,
best_of: Optional[int] = None,
repetition_penalty: Optional[float] = None,
return_full_text: bool = False,
seed: Optional[int] = None,
stop_sequences: Optional[List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
truncate: Optional[int] = None,
typical_p: Optional[float] = None,
watermark: bool = False,
) -> text_generation.types.Response
Docstring:
Given a prompt, generate the following text
Args:
prompt (`str`):
Input text
do_sample (`bool`):
Activate logits sampling
max_new_tokens (`int`):
Maximum number of generated tokens
best_of (`int`):
Generate best_of sequences and return the one if the highest token logprobs
repetition_penalty (`float`):
The parameter for repetition penalty. 1.0 means no penalty. See [this
paper](https://arxiv.org/pdf/1909.05858.pdf) for more details.
return_full_text (`bool`):
Whether to prepend the prompt to the generated text
seed (`int`):
Random sampling seed
stop_sequences (`List[str]`):
Stop generating tokens if a member of `stop_sequences` is generated
temperature (`float`):
The value used to module the logits distribution.
top_k (`int`):
The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (`float`):
If set to < 1, only the smallest set of most probable tokens with probabilities that add up to `top_p` or
higher are kept for generation.
truncate (`int`):
Truncate inputs tokens to the given size
typical_p (`float`):
Typical Decoding mass
See [Typical Decoding for Natural Language Generation](https://arxiv.org/abs/2202.00666) for more information
watermark (`bool`):
Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
Returns:
Response: generated response
You need to use the max_new_tokens
parameter.
What's the best practice to have external context/input in this prompt format? How to convert something like this:
Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "I don't know".
Context: Large Language Models (LLMs) are the latest models used in NLP.
Their superior performance over smaller models has made them incredibly
useful for developers building NLP enabled applications. These models
can be accessed via Hugging Face's `transformers` library, via OpenAI
using the `openai` library, and via Spark NLP using the `spark-nlp` library.
Question: Which libraries and model providers offer LLMs?
Answer: