How can I run this?

by Matan1905 - opened Mar 12, 2023

Matan1905

Mar 12, 2023

Hey, I'm new to hugging face and I tried to create a GPU gradio space using the deploy button, but it doesn't work like it does in this main repo.
What am I missing?

oobabooga

Mar 12, 2023

https://github.com/oobabooga/text-generation-webui/issues/253

It takes 13213MiB VRAM in 8-bit mode.

olivierdehaene

OpenAssistant org Mar 13, 2023

Juste use the HuggingFace text-generation client:

pip install text-generation

from text_generation import InferenceAPIClient

client = InferenceAPIClient("OpenAssistant/oasst-sft-1-pythia-12b")
text = client.generate("<|prompter|>Why is the sky blue?<|endoftext|><|assistant|>").generated_text
print(text)

# Token Streaming
text = ""
for response in client.generate_stream("<|prompter|>Why is the sky blue?<|endoftext|><|assistant|>"):
   if not response.token.special:
       print(response.token.text)
       text += response.token.text
print(text)

Matan1905

Mar 13, 2023

Thanks!

olivierdehaene

OpenAssistant org Mar 13, 2023

It seems we have a small issue with the model we are stopping the api-inference support for now while we figure out what's wrong.

MrlolDev

OpenAssistant org Mar 14, 2023

Any estimated time when it will come back?

olivierdehaene

OpenAssistant org Mar 15, 2023

The issue is quite deep in Transformers. You can track its evolution here: https://github.com/huggingface/transformers/issues/22161
While we find a solution, inference API will unfortunately stay off for this model.

olivierdehaene

OpenAssistant org Mar 15, 2023

The issue should be fixed and the inference API is back online.
If you see any weird outputs were the model seem to always repeat the same token, please inform us here.

cyt79

Apr 25, 2023

@olivierdehaene Is there a way to generate multiple sequence with client.generete call? (such as setting num_return_sequences parameter to a number greater than 1 in model.generate() call)

olivierdehaene

OpenAssistant org Apr 25, 2023

You can use the best of parameter.
Or simply do multiple calls.

cyt79

Apr 26, 2023

@olivierdehaene thanks for your reply. I can increase best of parameter up to 2. Another quesiton is, how can I increase the max length of generated text? It seems like the client.generate() method doesn't have "max_length" parameter.

olivierdehaene

OpenAssistant org Apr 26, 2023

from text_generation import Client

Client.generate?

Gives you:

Signature:
Client.generate(
    self,
    prompt: str,
    do_sample: bool = False,
    max_new_tokens: int = 20,
    best_of: Optional[int] = None,
    repetition_penalty: Optional[float] = None,
    return_full_text: bool = False,
    seed: Optional[int] = None,
    stop_sequences: Optional[List[str]] = None,
    temperature: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    truncate: Optional[int] = None,
    typical_p: Optional[float] = None,
    watermark: bool = False,
) -> text_generation.types.Response
Docstring:
Given a prompt, generate the following text

Args:
    prompt (`str`):
        Input text
    do_sample (`bool`):
        Activate logits sampling
    max_new_tokens (`int`):
        Maximum number of generated tokens
    best_of (`int`):
        Generate best_of sequences and return the one if the highest token logprobs
    repetition_penalty (`float`):
        The parameter for repetition penalty. 1.0 means no penalty. See [this
        paper](https://arxiv.org/pdf/1909.05858.pdf) for more details.
    return_full_text (`bool`):
        Whether to prepend the prompt to the generated text
    seed (`int`):
        Random sampling seed
    stop_sequences (`List[str]`):
        Stop generating tokens if a member of `stop_sequences` is generated
    temperature (`float`):
        The value used to module the logits distribution.
    top_k (`int`):
        The number of highest probability vocabulary tokens to keep for top-k-filtering.
    top_p (`float`):
        If set to < 1, only the smallest set of most probable tokens with probabilities that add up to `top_p` or
        higher are kept for generation.
    truncate (`int`):
        Truncate inputs tokens to the given size
    typical_p (`float`):
        Typical Decoding mass
        See [Typical Decoding for Natural Language Generation](https://arxiv.org/abs/2202.00666) for more information
    watermark (`bool`):
        Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)

Returns:
    Response: generated response

You need to use the max_new_tokens parameter.

MaziyarPanahi

May 12, 2023

What's the best practice to have external context/input in this prompt format? How to convert something like this:

Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "I don't know".

Context: Large Language Models (LLMs) are the latest models used in NLP.
Their superior performance over smaller models has made them incredibly
useful for developers building NLP enabled applications. These models
can be accessed via Hugging Face's `transformers` library, via OpenAI
using the `openai` library, and via Spark NLP using the `spark-nlp` library.

Question: Which libraries and model providers offer LLMs?

Answer:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment