RagRetriever / README.md
zpbrent's picture
Update README.md
ac0e670 verified
|
raw
history blame
2.83 kB
metadata
license: apache-2.0
thumbnail: https://huggingface.co/front/thumbnails/facebook.png

Attention! This is a malware model deployed here just for research purpose. Please do not use it elsewhere for any illegal purpose, otherwise you shold bear full legal responsibility given any abuse.

cite our work for more details at:

Peng Zhou, “How to Make Hugging Face to Hug Worms: Discovering and Exploiting Unsafe Pickle.loads over Pre-Trained Large Model Hubs”, BlackHat ASIA, 2024, Singapore.

RAG

This is a non-finetuned version of the RAG-Sequence model of the the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.

Rag consits of a question encoder, retriever and a generator. The retriever should be a RagRetriever instance. The question encoder can be any model that can be loaded with AutoModel and the generator can be any model that can be loaded with AutoModelForSeq2SeqLM.

This model is a non-finetuned RAG-Sequence model and was created as follows:

from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration, AutoTokenizer

model = RagSequenceForGeneration.from_pretrained_question_encoder_generator("repo_name")

question_encoder_tokenizer = AutoTokenizer.from_pretrained("repo_name")
generator_tokenizer = AutoTokenizer.from_pretrained("repo_name")

tokenizer = RagTokenizer(question_encoder_tokenizer, generator_tokenizer)
model.config.use_dummy_dataset = True
model.config.index_name = "exact"
retriever = RagRetriever(model.config, question_encoder_tokenizer, generator_tokenizer)

model.save_pretrained("./")
tokenizer.save_pretrained("./")
retriever.save_pretrained("./")

Note that the model is uncased so that all capital input letters are converted to lower-case.

Usage:

Note: the model uses the dummy retriever as a default. Better results are obtained by using the full retriever, by setting config.index_name="legacy" and config.use_dummy_dataset=False. The model can be fine-tuned as follows:

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

tokenizer = RagTokenizer.from_pretrained("repo_name")
retriever = RagRetriever.from_pretrained("repo_name")
model = RagTokenForGeneration.from_pretrained("repo_name", retriever=retriever)

input_dict = tokenizer.prepare_seq2seq_batch("who holds the record in 100m freestyle", "michael phelps", return_tensors="pt") 

outputs = model(input_dict["input_ids"], labels=input_dict["labels"])

loss = outputs.loss

# train on loss