teapot123 commited on
Commit
9444298
1 Parent(s): 2a28f12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -6,7 +6,7 @@ tags: []
6
  # Model Card for Model ID
7
 
8
  <!-- Provide a quick summary of what the model is/does. -->
9
- **PPO-C** (PPO with Calibrated Reward Score) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
10
  PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
11
  classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
12
  Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.
 
6
  # Model Card for Model ID
7
 
8
  <!-- Provide a quick summary of what the model is/does. -->
9
+ **PPO-C** (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
10
  PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
11
  classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
12
  Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.