File size: 2,258 Bytes
fd95cfe
70d252c
 
 
 
 
 
 
 
 
 
 
 
c661131
 
11d1ab0
70d252c
825ac58
70d252c
fd95cfe
978cd1f
8cb0c77
978cd1f
 
026f834
 
 
70d252c
fd95cfe
70d252c
fd95cfe
70d252c
fd95cfe
70d252c
 
 
 
 
 
 
fd95cfe
70d252c
fd95cfe
70d252c
 
 
 
fd95cfe
70d252c
 
fd95cfe
70d252c
fd95cfe
70d252c
 
 
 
 
fd95cfe
70d252c
 
 
 
fd95cfe
70d252c
fd95cfe
70d252c
 
026f834
 
fd95cfe
280a6f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70d252c
11d1ab0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
library_name: custom
tags:
- robotics
- diffusion
- mixture-of-experts
- multi-modal
license: mit
datasets:
- CALVIN
languages:
- en
pipeline_tag: robotics
base_model:
- mbreuss/MoDE_Pretrained
---
# MoDE (Mixture of Denoising Experts) Diffusion Policy

## Model Description

<div style="text-align: center">
    <img src="MoDE_Figure_1.png" width="800px"/>
</div>

- [Github Link](https://github.com/intuitive-robots/MoDE_Diffusion_Policy) 
- [Project Page](https://mbreuss.github.io/MoDE_Diffusion_Policy/) 

This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based backbone with noise-only expert routing. For faster inference, we can precache the chosen expert for each timestep to reduce computation time.

The model has been pretrained on a subset of OXE for 300k steps and finetuned for downstream tasks on the CALVIN/LIBERO dataset.

## Model Details

### Architecture
- **Base Architecture**: MoDE with custom Mixture of Experts Transformer
- **Vision Encoder**: ResNet-50 with FiLM conditioning finetuned from ImageNet
- **EMA**: Enabled
- **Action Window Size**: 10
- **Sampling Steps**: 5 (optimal for performance)
- **Sampler Type**: DDIM

### Input/Output Specifications

#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings

#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions

## Usage

```python
obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```

## Training Details

### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 0.0001
- **Weight Decay**: 0.05


## Citation

If you found the code usefull, please cite our work:

```bibtex
@misc{reuss2024efficient,
    title={Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning},
    author={Moritz Reuss and Jyothish Pari and Pulkit Agrawal and Rudolf Lioutikov},
    year={2024},
    eprint={2412.12953},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```


## License
This model is released under the MIT license.