Safetensors
gemma
Moyu-hrsun commited on
Commit
fb86292
·
verified ·
1 Parent(s): cc30606

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - Dahoas/full-hh-rlhf
5
+ base_model:
6
+ - google/gemma-7b
7
+ ---
8
+ # Model Card for MA-RLHF
9
+ <a href="https://iclr.cc/Conferences/2024" target="_blank">
10
+ <img alt="ICLR 2025" src="https://img.shields.io/badge/Proceedings-ICLR2025-red" />
11
+ </a>
12
+ <a href="https://github.com/ernie-research/MA-RLHF" target="_blank">
13
+ <img alt="Github" src="https://img.shields.io/badge/Github-MA_RLHF-green" />
14
+ </a>
15
+
16
+ This repository contains the official checkpoint for [Reinforcement Learning From Human Feedback with Macro Actions (MA-RLHF)](https://arxiv.org/pdf/2410.02743).
17
+
18
+ ## Model Description
19
+
20
+ MA-RLHF is a novel framework that integrates macro actions into conventional RLHF. The macro actions are sequences of tokens or higher-level language constructs, with can be computed through different defined termination conditions, like n-gram based, perplexity-based, or parsing-based termination conditions. By introducing macro actions into RLHF, we reduce the number of decision points and shorten decision trajectories, alleviating the credit assignment problem caused by long temporal distances.
21
+
22
+
23
+ |Model|Checkpoint|Base Model|Dataset|
24
+ |-----|----------|-|-|
25
+ |TLDR-Gemma-2B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-2B-MA-PPO-Fixed5)|[google/gemma-2b](https://huggingface.co/google/gemma-2b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
26
+ |TLDR-Gemma-7B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-7B-MA-PPO-Fixed5)|[google/gemma-7b](https://huggingface.co/google/gemma-7b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
27
+ |TLDR-Gemma-2-27B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-2-27B-MA-PPO-Fixed5)|[google/gemma-2-27b](https://huggingface.co/google/gemma-2-27b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
28
+ |HH-RLHF-Gemma-2B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/HH-RLHF-Gemma-2B-MA-PPO-Fixed5) |[google/gemma-2b](https://huggingface.co/google/gemma-2b)|[Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
29
+ |HH-RLHF-Gemma-7B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/HH-RLHF-Gemma-7B-MA-PPO-Fixed5) |[google/gemma-7b](https://huggingface.co/google/gemma-7b)|[Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
30
+ |APPS-Gemma-2B-MA-PPO-Fixed10|🤗 [HF Link](https://huggingface.co/baidu/APPS-Gemma-2B-MA-PPO-Fixed10) |[google/codegemma-2b](https://huggingface.co/google/codegemma-2b)|[codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
31
+ |APPS-Gemma-7B-MA-PPO-Fixed10|🤗 [HF Link](https://huggingface.co/baidu/APPS-Gemma-7B-MA-PPO-Fixed10) |[google/codegemma-7b-it](https://huggingface.co/google/codegemma-7b-it)|[codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
32
+
33
+
34
+ ## Model Usage
35
+
36
+ ```python
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer
38
+
39
+ model_path = "baidu/HH-RLHF-Gemma-7B-MA-PPO-Fixed5"
40
+
41
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
42
+
43
+ model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto', trust_remote_code=True)
44
+
45
+ input_text = """
46
+ Human: Would you be able to explain the differences between the Spanish
47
+ and Italian language? Assistant: Of course. Can you tell me more about
48
+ the specific areas where you’re interested in knowing more? Human: I’m
49
+ thinking between the Spanish spoken in Mexico and Italian spoken in Italy.
50
+ Assistant:
51
+ """
52
+
53
+ input_ids = tokenizer(input_text, return_tensors='pt').to(model.device)
54
+ output_ids = model.generate(**input_ids, max_new_tokens=20)
55
+ response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
56
+
57
+ print(response)
58
+ ```
59
+
60
+ ## Citation
61
+
62
+ ```
63
+ @inproceedings{
64
+ chai2025marlhf,
65
+ title={{MA}-{RLHF}: Reinforcement Learning from Human Feedback with Macro Actions},
66
+ author={Yekun Chai and Haoran Sun and Huang Fang and Shuohuan Wang and Yu Sun and Hua Wu},
67
+ booktitle={The Thirteenth International Conference on Learning Representations},
68
+ year={2025},
69
+ url={https://openreview.net/forum?id=WWXjMYZxfH}
70
+ }
71
+ ```