MoDE_CALVIN_D / README.md
mbreuss's picture
Upload folder using huggingface_hub
9a6f3b8 verified
|
raw
history blame
1.82 kB
---
library_name: custom
tags:
- robotics
- diffusion
- mixture-of-experts
- multi-modal
license: mit
datasets:
- CALVIN
languages:
- en
pipeline_tag: robotics
---
# MoDE (Mixture of Denoising Experts) Diffusion Policy
## Model Description
<div style="text-align: center">
<img src="MoDE_Figure_1.png" width="800px"/>
</div>
- [Github Link](https://github.com/intuitive-robots/MoDE_Diffusion_Policy)
- [Project Page](https://mbreuss.github.io/MoDE_Diffusion_Policy/)
This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based backbone with noise-only expert routing. For faster inference, we can precache the chosen expert for each timestep to reduce computation time.
The model has been pretrained on a subset of OXE for 300k steps and finetuned for downstream tasks on the CALVIN/LIBERO dataset.
## Model Details
### Architecture
- **Base Architecture**: MoDE with custom Mixture of Experts Transformer
- **Vision Encoder**: ResNet-50 with FiLM conditioning finetuned from ImageNet
- **EMA**: Enabled
- **Action Window Size**: 10
- **Sampling Steps**: 5 (optimal for performance)
- **Sampler Type**: DDIM
### Input/Output Specifications
#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings
#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions
## Usage
```python
obs = {
"rgb_obs": {
"rgb_static": static_image,
"rgb_gripper": gripper_image
}
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```
## Training Details
### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 0.0001
- **Weight Decay**: 0.05
## License
This model is released under the MIT license.