mav23's picture
Upload folder using huggingface_hub
43ac42a verified
metadata
base_model:
  - BSC-LT/salamandra-7b-instruct
datasets:
  - alinia/EADOP-RAG-out-of-domain
language:
  - ca
  - es
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
  - legal

Salamandra 7B aligned EADOP Model Card

Salamandra 7B aligned EADOP is a full-finetuning version of BSC Language Technologies Unit's Salamandra Instruct 7B model of the Barcelona Supercomputing Center focused on improving the handling of out-of-domain Questions in a RAG instruction-following setting.

The model has been finetuned on a dataset consisting of 2,000+ human annotated in- and out-of-domain user messages and assistant responses in the context of a chatbot that can provide helpful information about the current Catalan legislation. The dataset alinia/EADOP-RAG-out-of-domain was collected in collaboration with the Entitat Autònoma del Diari Oficial i de Publicacions (EADOP) and it consists of user messages and assistant responses in Catalan and Spanish.

DISCLAIMER: This model is a proof-of-concept designed to demonstrate the effects of finetuning an Instruction model with a small dataset of out-of-domain questions in the model's capability to politely and informatively refuse to answer questions that are out-of-domain. As a proof-of-concept, the model is still prone to generate harmful or inappropriate content.


Model Details

Please refer to the Salamandra Instruct 7B model details for the specific details about the model architecture and pretraining.

Intended Use

This model was developed as a proof-of-concept to demonstrate the effects of finetuning an Instruction model with a small dataset of in- and out-of-domain questions in the model's capability to politely and informatively refuse to answer questions that are out-of-domain in the context of a domain-specific RAG-based chatbot.

How to use

This model uses the ChatML, the same instruction-following conversation format as the base model.

from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "projecte-aina/salamandra-7b-aligned-EADOP"

text = "Quina és la finalitat del Servei Meterològic de Catalunya ?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using this template, each turn is preceded by a <|im_start|> delimiter and the role of the entity (either user, for content supplied by the user, or assistant for LLM responses), and finished with the <|im_end|> token.


Finetuning Data

Please refer to alinia/EADOP-RAG-out-of-domain for the Dataset Card.

Author

This model has been finetuned by Alinia AI.

Contact

For further information, please email langtech@bsc.es.

Copyright

Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

License

Apache-2.0

Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Acknowledgements

The data collection process was supported by the Entitat Autònoma del Diari Oficial i de Publicacions (EADOP).