Model Card for OSCAR-mistral-7B
OSCAR is a context compression model to be used for efficient inference when doing Retrieval Augmented Generation (RAG), particularly optimized for question answering.
OSCAR contains a (fast and light) compressor LLM, used to compress documents, and a LoRA-adapted decoder LLM (here mistral-7B) able to work from this representation.
In a RAG pipeline compressing the documents enable 3x-5x faster inference. Final pipeline is as performant as the base decoder model.
Developed by: Naver Labs Europe
License: CC BY-NC 4.0.
- Model:
oscar-mistral-7B
- Backbone model: mistralai/Mistral-7B-Instruct-v0.2
- Compression model: meta-llama/Llama-3.2-1B-Instruct
- Model size: 7.33 billion parameters
- Compression rate: x16: each document (of size up to 128 tokens) is converted into 8 embedding vectors.
Usage
from transformers import AutoModel
oscar = AutoModel.from_pretrained('naver/oscar-mistral-7B', trust_remote_code=True).to('cuda')
# Example documents and question:
documents = [
[
"Weldenia is a monotypic genus of flowering plant in the family Commelinaceae, first describ ed in 1829. It has one single species: Weldenia candida, which grows originally in Mexico and Guatemala.",
"Hagsatera is a genus of flowering plants from the orchid family, Orchidaceae. There are two known species, native to Mexico and Guatemala",
"Alsobia is a genus of flowering plants in the family Gesneriaceae, native to Mexico, Guatemala and Costa Rica. The two species are succulent, stoloniferous herbs and were previously included in the genus \"Episcia\". Recent molecular studies have supported the separation of \"Alsobia\" from \"Episcia\""
]
]
questions = ["Which genus of plant grows originally in Mexico and Guatemala, Phylica or Weldenia?"]
# End-to-end usage
out = oscar.generate_from_text(questions=questions, documents=documents, max_new_tokens=64, query_dependent=True)
print('Generated answer', out)
# Document compression:
embeddings = oscar.compress_documents(documents=documents[0], questions=questions * len(documents[0])) # compression is query-dependent, one question per doc here
# Generation from compressed documents:
out = oscar.generate_from_compressed_documents_and_questions(questions=questions, compressed_documents=embeddings)
The recommended usage is to provide documents cropped to about 128 tokens, which is common practice when doing RAG.
Model features
- OSCAR enables high accuracy responses from the compressed documents
- OSCAR is robust to various domains We tested its compression/decoding abilities on various sets of data.
- OSCAR enables up to x5 faster generation depending on the number of retrieved documents and various context sizes.
License
This work is licensed under CC BY-NC 4.0.
Cite
TODO
Acknowledgements
Model trained at Naver Labs Europe
Team:
- Downloads last month
- 105
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for naver/oscar-mistral-7B
Base model
meta-llama/Llama-3.2-1B-Instruct