ericflo commited on
Commit
c4e44b3
·
verified ·
1 Parent(s): a086193

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -3
README.md CHANGED
@@ -1,3 +1,133 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - meta-llama/Llama-3.2-3B-Instruct
5
+ library_name: transformers
6
+ datasets:
7
+ - ericflo/Llama-3.2-3B-COT
8
+ ---
9
+
10
+ # Thought-Ranked Llama 3.2 3B v3.0
11
+
12
+ ## What's New in v3?
13
+
14
+ The major advancement in v3 is the integration of reinforcement learning to refine the model's outputs. Using OpenRLHF with REINFORCE and Gemini 1.5 Flash 8B as a judge, we've optimized the model to produce higher quality responses across various criteria including relevance, accuracy, clarity, style, and completeness.
15
+
16
+ This RL fine-tuning process used a sophisticated reward model that evaluates responses on a 0-99 scale, considering factors such as:
17
+ - Intent fulfillment and practical utility
18
+ - Factual accuracy and logical consistency
19
+ - Clarity and understandability
20
+ - Style and tone appropriateness
21
+ - Completeness and detail sufficiency
22
+
23
+ ## How It Works
24
+
25
+ The model maintains the same powerful thought chain capabilities from v2.2, but with enhanced output quality. Here's an example:
26
+
27
+ ```
28
+ <thoughts>
29
+ <thought>First, I should consider the moon's main effects on Earth</thought>
30
+ <thought>The moon controls our tides, so ocean patterns would change dramatically</thought>
31
+ <thought>Without the moon's gravitational pull, Earth's rotation would become unstable</thought>
32
+ <thought>This would lead to extreme climate changes and disrupted ecosystems</thought>
33
+ <thought>The loss of moonlight would affect nocturnal animals and human culture</thought>
34
+ <thought>Combining all these effects, we'd see a cascade of environmental changes</thought>
35
+ </thoughts>
36
+
37
+ The disappearance of the moon would have far-reaching consequences for Earth...
38
+ [detailed answer follows]
39
+ ```
40
+
41
+ ### System Messages
42
+
43
+ The model continues to support various system prompts:
44
+
45
+ 1. Basic prompt:
46
+ ```
47
+ {"role": "system", "content": "You are a helpful assistant. Think before responding."}
48
+ ```
49
+
50
+ 2. Specific thought count:
51
+ ```
52
+ {"role": "system", "content": "You are a helpful assistant. Think 3 thoughts before responding."}
53
+ ```
54
+
55
+ 3. Standard helper:
56
+ ```
57
+ {"role": "system", "content": "You are a helpful assistant."}
58
+ ```
59
+
60
+ ## Technical Details
61
+
62
+ ### Base Architecture
63
+ - **Base Model**: Llama 3.2 3B
64
+ - **Initial Training**: 2,500 carefully selected examples with up to 6 levels of thought chains
65
+ - **Thought Selection**: Multi-level thought generation with external ranking system
66
+
67
+ ### RL Fine-tuning
68
+ - **Framework**: OpenRLHF
69
+ - **Algorithm**: REINFORCE
70
+ - **Judge Model**: Gemini 1.5 Flash 8B
71
+ - **Training Parameters**:
72
+ - Actor Learning Rate: 5e-7
73
+ - Critic Learning Rate: 9e-6
74
+ - Initial KL Coefficient: 0.01
75
+ - Batch Size: 128
76
+ - Max Epochs: 1
77
+ - Prompt/Generation Max Length: 1024
78
+ - BF16 Precision
79
+ - Flash Attention enabled
80
+ - Gradient Checkpointing
81
+ - **Training Data**: OpenRLHF/prompt-collection-v0.1
82
+ - **Infrastructure**: Ray distributed training with VLLM acceleration
83
+
84
+ ## What's It Good For?
85
+
86
+ The model excels at tasks requiring careful thinking and high-quality outputs:
87
+
88
+ ✅ Breaking down complex problems with logical progression
89
+ ✅ Step-by-step mathematical solutions with clear explanations
90
+ ✅ Detailed analysis with well-structured arguments
91
+ ✅ Clear and appropriate explanations of complicated concepts
92
+ ✅ Well-reasoned decision-making with supporting evidence
93
+
94
+ ## Limitations
95
+
96
+ - May still occasionally overthink simple problems
97
+ - Bounded by base Llama 3.2 3B model capabilities
98
+ - Not suitable for critical decisions without human oversight
99
+ - Could generate irrelevant thought chains in edge cases
100
+ - RL training might lead to occasional reward hacking behaviors
101
+
102
+ ## Example Usage
103
+
104
+ ```python
105
+ from transformers import AutoModelForCausalLM, AutoTokenizer
106
+
107
+ model = AutoModelForCausalLM.from_pretrained("ericflo/Llama-3.2-3B-COT-v3.0")
108
+ tokenizer = AutoTokenizer.from_pretrained("ericflo/Llama-3.2-3B-COT-v3.0")
109
+
110
+ messages = [
111
+ {"role": "system", "content": "You are a helpful assistant. Think 3 thoughts before responding."},
112
+ {"role": "user", "content": "How would you teach a child to ride a bike?"}
113
+ ]
114
+
115
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
116
+ output = model.generate(input_ids, temperature=1.0)
117
+ response = tokenizer.decode(output[0])
118
+ ```
119
+
120
+ ## Citation
121
+
122
+ ```bibtex
123
+ @misc{thought-ranked-llama-v3,
124
+ title={Thought-Ranked Llama 3.2 v3: RL-Optimized Hierarchical Chain-of-Thought Generation},
125
+ author={[Eric Florenzano]},
126
+ year={2024},
127
+ howpublished={\url{https://huggingface.co/ericflo/Llama-3.2-3B-COT-v3}}
128
+ }
129
+ ```
130
+
131
+ ## Acknowledgments
132
+
133
+ This model builds on the Llama 3.2 3B base model from Meta and incorporates RL training using Google's Gemini 1.5 Flash 8B as a judge. Special thanks to the open-source AI community for their contributions to chain-of-thought prompting techniques and reinforcement learning frameworks.