README.md · cornstack/CodeRankLLM at 6f4cc2c6d838c6113ba3bed0c88ff0aa818b68f8

metadata

license: mit
base_model:
  - Qwen/Qwen2.5-Coder-7B-Instruct

CodeRankLLM is a 7B LLM fine-tuned for listwise code-reranking. When combined with performant code retrievers like CodeRankEmbed, it significantly enhances the quality of retrieved results for various code retrieval tasks.

We release the scripts to evaluate our model's performance here.

Training

Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset CoRNStack, filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage Qwen-2.5-32B-Instruct LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with Qwen2.5-Coder-7B-Instruct and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.