File size: 2,258 Bytes
fd95cfe 70d252c c661131 11d1ab0 70d252c 825ac58 70d252c fd95cfe 978cd1f 8cb0c77 978cd1f 026f834 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c fd95cfe 70d252c 026f834 fd95cfe 280a6f4 70d252c 11d1ab0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
library_name: custom
tags:
- robotics
- diffusion
- mixture-of-experts
- multi-modal
license: mit
datasets:
- CALVIN
languages:
- en
pipeline_tag: robotics
base_model:
- mbreuss/MoDE_Pretrained
---
# MoDE (Mixture of Denoising Experts) Diffusion Policy
## Model Description
<div style="text-align: center">
<img src="MoDE_Figure_1.png" width="800px"/>
</div>
- [Github Link](https://github.com/intuitive-robots/MoDE_Diffusion_Policy)
- [Project Page](https://mbreuss.github.io/MoDE_Diffusion_Policy/)
This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based backbone with noise-only expert routing. For faster inference, we can precache the chosen expert for each timestep to reduce computation time.
The model has been pretrained on a subset of OXE for 300k steps and finetuned for downstream tasks on the CALVIN/LIBERO dataset.
## Model Details
### Architecture
- **Base Architecture**: MoDE with custom Mixture of Experts Transformer
- **Vision Encoder**: ResNet-50 with FiLM conditioning finetuned from ImageNet
- **EMA**: Enabled
- **Action Window Size**: 10
- **Sampling Steps**: 5 (optimal for performance)
- **Sampler Type**: DDIM
### Input/Output Specifications
#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings
#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions
## Usage
```python
obs = {
"rgb_obs": {
"rgb_static": static_image,
"rgb_gripper": gripper_image
}
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```
## Training Details
### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 0.0001
- **Weight Decay**: 0.05
## Citation
If you found the code usefull, please cite our work:
```bibtex
@misc{reuss2024efficient,
title={Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning},
author={Moritz Reuss and Jyothish Pari and Pulkit Agrawal and Rudolf Lioutikov},
year={2024},
eprint={2412.12953},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## License
This model is released under the MIT license. |