AdamG012 commited on
Commit
1851e6c
1 Parent(s): 93ceeac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -17,11 +17,11 @@ datasets:
17
  ---
18
  ---
19
 
20
- # ChatGPT OPT 1.3B DeepSpeed Reinforcement Learning from Human feedback
21
 
22
- *fsalab-chat-opt-1.3b-rlhf-critic-deepspeed*
23
 
24
- This model consists of the final step of a modified pipeline the to the traditional training process of Chat-GPT models, which is comprised of a three-step procedure of [supervised fine tuning](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-sft-deepspeed), [reward model](https://huggingface.co/FSALab/fsalab-chat-opt-350m-reward-deepspeed) and **reinforcement learning from human feedback**.
25
 
26
  This project's main goal was to make proper use of existing frameworks that revolve around the minimisation of training costs and thus the eventual improvements towards both the feasibility and usability of ChatGPT-like models. The framework selected here is DeepSpeed which has been instrumental in the development of this model and through this framework it was possible to train the ChatGPT-like model on much larger data-sets with a reasonable number of GPUs and consequently achieve significantly better performance.
27
 
@@ -37,11 +37,11 @@ python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m
37
 
38
  This pipeline can be broken up into three key steps:
39
 
40
- 1. **Supervised fine-tuning (SFT):** See [here](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-sft-deepspeed)
41
 
42
- 2. **Reward Model (RM) fine-tuning:** See [here](https://huggingface.co/FSALab/fsalab-chat-opt-350m-reward-deepspeed)
43
 
44
- 3. **Reinforcement-learning from Human feedback (RLHF) fine-tuning:** At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the *fine-tuned model* from step 1 and the *reward model** from step 2 and train them on the data-set with comparisons. This generates both an [actor](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed) and **critic** model. I also generate an *[actor model](https://huggingface.co/FSALab/chat-opt-1.3b-rlhf-actor-ema-deepspeed) with an exponential moving average (EMA)* which is known to improve conversational response quality.
45
 
46
 
47
  To view the details behind each step head into their respective links and view the model card there.
 
17
  ---
18
  ---
19
 
20
+ # ChatGPT OPT 1.3B DeepSpeed Reinforcement Learning from Human Feedback Critic Model
21
 
22
+ *chat-opt-1.3b-rlhf-critic-deepspeed*
23
 
24
+ This model consists of the final step of a modified pipeline the to the traditional training process of Chat-GPT models, which is comprised of a three-step procedure of [supervised fine tuning](https://huggingface.co/AdamG012/chat-opt-1.3b-sft-deepspeed), [reward model](https://huggingface.co/AdamG012/chat-opt-350m-reward-deepspeed) and **reinforcement learning from human feedback models**; [actor](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-deepspeed), [actor EMA](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-ema-deepspeed) and [critic](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-critic-deepspeed) models.
25
 
26
  This project's main goal was to make proper use of existing frameworks that revolve around the minimisation of training costs and thus the eventual improvements towards both the feasibility and usability of ChatGPT-like models. The framework selected here is DeepSpeed which has been instrumental in the development of this model and through this framework it was possible to train the ChatGPT-like model on much larger data-sets with a reasonable number of GPUs and consequently achieve significantly better performance.
27
 
 
37
 
38
  This pipeline can be broken up into three key steps:
39
 
40
+ 1. **Supervised fine-tuning (SFT):** See [here](https://huggingface.co/AdamG012/chat-opt-1.3b-sft-deepspeed/).
41
 
42
+ 2. **Reward Model (RM) fine-tuning:** See [here](https://huggingface.co/AdamG012/chat-opt-350m-reward-deepspeed).
43
 
44
+ 3. **Reinforcement-learning from Human feedback (RLHF) fine-tuning:** At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the *fine-tuned model* from step 1 and the *reward model* from step 2 and train them on the data-set with comparisons. This generates both an [actor](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-deepspeed) and **critic** model. I also generate an [actor model with an exponential moving average (EMA)](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-ema-deepspeed) which is known to improve conversational response quality.
45
 
46
 
47
  To view the details behind each step head into their respective links and view the model card there.