---
                library_name: custom
                tags:
                - robotics
                - diffusion
                - mixture-of-experts
                - multi-modal
                license: mit
                datasets:
                - CALVIN
                languages:
                - en
                pipeline_tag: robotics
                ---
    # MoDE (Mixture of Denoising Experts) Diffusion Policy

    ## Model Description

    This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based backbone with noise-only expert routing. For faster inference, we can precache the chosen expert for each timestep to reduce computation time.

    The model has been pretrained on a subset of OXE for 300k steps and finetuned for downstream tasks on the CALVIN/LIBERO dataset.

    ## Model Details

    ### Architecture
    - **Base Architecture**: MoDE with custom Mixture of Experts Transformer
    - **Vision Encoder**: ResNet-50 with FiLM conditioning finetuned from ImageNet
    - **EMA**: Enabled
    - **Action Window Size**: 10
    - **Sampling Steps**: 5 (optimal for performance)
    - **Sampler Type**: DDIM

    ### Input/Output Specifications

    #### Inputs
    - RGB Static Camera: `(B, T, 3, H, W)` tensor
    - RGB Gripper Camera: `(B, T, 3, H, W)` tensor
    - Language Instructions: Text strings

    #### Outputs
    - Action Space: `(B, T, 7)` tensor representing delta EEF actions

    ## Usage

    ```python
    obs = {
        "rgb_obs": {
            "rgb_static": static_image,
            "rgb_gripper": gripper_image
        }
    }
    goal = {"lang_text": "pick up the blue cube"}
    action = model.step(obs, goal)
    ```

    ## Training Details

    ### Configuration
    - **Optimizer**: AdamW
    - **Learning Rate**: {config.optimizer.learning_rate}
    - **Weight Decay**: {config.optimizer.transformer_weight_decay}

    ## License
    This model is released under the MIT license.