SCoReLoRA: Self-Correct via Reinforcement Learning

SCoReLoRA is an innovative approach to fine-tuning language models using Low-Rank Adaptation (LoRA) combined with reinforcement learning techniques for self-correction. This method aims to improve the model's ability to generate more accurate and refined responses through a two-stage training process.

Features

Implements a two-stage training process for self-correction
Utilizes reinforcement learning to improve model outputs
Compatible with Hugging Face's Transformers library and PEFT
Supports quantized models for efficient fine-tuning
Includes evaluation metrics for self-correction performance

How It Works

SCoreLora uses a two-stage training process:

Stage I: The model is trained to generate initial responses and then correct them, minimizing the KL divergence between the base model and the fine-tuned model.
Stage II: The model is further trained using reinforcement learning techniques, with rewards based on the quality of self-corrections.

The training process utilizes shaped rewards and KL divergence to balance between improvement and staying close to the original model's behavior.

Evaluation

The implementation includes functions to evaluate the model's self-correction capabilities, measuring metrics such as:

Accuracy before and after correction
Improvement rate
Rate of successful corrections
Rate of erroneous corrections

Reference

Training Language Models to Self-Correct via Reinforcement Learning