image/png

Salamandra-2b-instruct-fp8 Model Card

This model is the fp8-quantized version of Salamandra-2b-instruct.

The model weights are quantized from FP16 to FP8 (8-bit weights) using the FP8 quantization algorithm from NeuralMagic. Inferencing with this model can be done using VLLM.

Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants, promoted and financed by the Government of Catalonia through the Aina Project and the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of ILENIA Project with reference 2022/TL22/00215337.

This model card corresponds to the fp8-quantized version of Salamandra-2b-instruct.

The entire Salamandra family is released under a permissive Apache 2.0 license.

How to Use

The following example code works under Python 3.9.16, vllm==0.6.3.post1, torch==2.4.0 and torchvision==0.19.0, though it should run on any current version of the libraries. This is an example of a conversational chatbot using the model:

from vllm import LLM, SamplingParams

model_name = "BSC-LT/salamandra-2b-instruct-fp8"
llm = LLM(model=model_name)

messages = []

while True:
    user_input = input("user >> ")
    if user_input.lower() == "exit":
        print("Chat ended.")
        break

    messages.append({'role': 'user', 'content': user_input})

    outputs = llm.chat(messages,
                       sampling_params=SamplingParams(
                           temperature=0.5,
                           stop_token_ids=[5],
                           max_tokens=200)
                       )[0].outputs
    
    model_output = outputs[0].text
    print(f'assistant >> {model_output}')
    messages.append({'role': 'assistant', 'content': model_output})

Author

International Business Machines (IBM).

Copyright

International Business Machines (IBM).

Contact

For further information, please send an email to langtech@bsc.es.

Acknowledgements

We appreciate the collaboration with IBM in this work. Specifically, the IBM team created fp8-quantized version of the Salamandra-2b-instruct model released here.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

Barcelona Supercomputing Center and International Business Machines shall not be held liable for any outcomes resulting from third-party use.

License

Apache License, Version 2.0

Downloads last month
37
Safetensors
Model size
2.25B params
Tensor type
BF16
·
F8_E4M3
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for BSC-LT/salamandra-2b-instruct-fp8

Finetuned
(1)
this model

Collection including BSC-LT/salamandra-2b-instruct-fp8