soyuj/llama2-doc2query · Hugging Face

This repository contains the LoRA weights for fine-tuning pre-trained Llama 2 7B for document expansion for use with DeeperImpact.

We use the same dataset as DocT5Query for fine-tuning the pre-trained Llama 2 model i.e. 532k document-query pairs from MSMARCO Passage Qrels Train Dataset.

Please refer to the following GitHub repository to learn how to use it for document expansion: inference_deeper_impact.ipynb

You can also clone the DeeperImpact repo and run expansions on a collection of documents using the following command:

python -m src.llama2.generate \
    --llama_path <path | HuggingFaceHub link> \
    --collection_path <path> \
    --collection_type [msmarco | beir] \
    --output_path <path> \
    --batch_size <batch_size> \
    --max_tokens 512 \
    --num_return_sequences 80 \
    --max_new_tokens 50 \
    --top_k 50 \
    --top_p 0.95 \
    --peft_path soyuj/llama2-doc2query

This will generate a jsonl file with expansions for each document in the collection. To append the unique expansion terms to the original collection, use the following command:

python -m src.llama2.merge \
  --collection_path <path> \
  --collection_type [msmarco | beir] \
  --queries_path <jsonl file generated above> \
  --output_path <path>