README.md · mbreuss/MoDE_CALVIN_ABC_2 at 854a5542d8251434e299503cf72719d3d17be468

        ---
        library_name: custom
        tags:
        - robotics
        - diffusion
        - mixture-of-experts
        - multi-modal
        license: mit
        datasets:
        - CALVIN
        language:
        - en
        pipeline_tag: robotics
        ---
        # MoDE (Mixture 1of Diffusion Experts) Model

        This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based processing with expert routing and diffusion-based action prediction.

        ## Model Architecture
        - Base Architecture: MoDE with custom Mixture of Experts Transformer
        - Vision Encoder: {getattr(model_instance, 'resnet_type', 'ResNet')} with FiLM conditioning
        - EMA: Enabled
        - Action Window Size: {model_instance.act_window_size}
        - Sampling Steps: {model_instance.num_sampling_steps}
        - Sampler Type: {model_instance.sampler_type}

        ## Input/Output Specifications
        - RGB Static Camera: (B, T, 3, H, W) tensor
        - RGB Gripper Camera: (B, T, 3, H, W) tensor
        - Language Instructions: Text strings
        - Output: (B, T, 7) tensor representing 7-DoF actions

        ## Usage Example
        ```python
        from huggingface_hub import hf_hub_download
        import torch

        weights_path = hf_hub_download(repo_id="{repo_name}", filename="model_cleaned.safetensors")
        model.load_pretrained_parameters(weights_path)

        obs = {
            "rgb_obs": {
                "rgb_static": static_image,
                "rgb_gripper": gripper_image
            }
        }
        goal = {"lang_text": "pick up the blue cube"}
        action = model.step(obs, goal)
        ```

        ## Training Configuration
        - Optimizer: AdamW
        - Learning Rate: {config.optimizer.learning_rate}
        - Weight Decay: {config.optimizer.transformer_weight_decay}