This repository contains the LoRA weights for fine-tuning pre-trained Llama 2 7B for document expansion for use with DeeperImpact.
We use the same dataset as DocT5Query for fine-tuning the pre-trained Llama 2 model i.e. 532k document-query pairs from MSMARCO Passage Qrels Train Dataset.
Please refer to the following GitHub repository to learn how to use it for document expansion: inference_deeper_impact.ipynb
You can also clone the DeeperImpact repo and run expansions on a collection of documents using the following command:
python -m src.llama2.generate \
--llama_path <path | HuggingFaceHub link> \
--collection_path <path> \
--collection_type [msmarco | beir] \
--output_path <path> \
--batch_size <batch_size> \
--max_tokens 512 \
--num_return_sequences 80 \
--max_new_tokens 50 \
--top_k 50 \
--top_p 0.95 \
--peft_path soyuj/llama2-doc2query
This will generate a jsonl file with expansions for each document in the collection. To append the unique expansion terms to the original collection, use the following command:
python -m src.llama2.merge \
--collection_path <path> \
--collection_type [msmarco | beir] \
--queries_path <jsonl file generated above> \
--output_path <path>