--- library_name: custom tags: - robotics - diffusion - mixture-of-experts - multi-modal license: mit datasets: - CALVIN language: - en pipeline_tag: robotics --- # MoDE (Mixture 1of Diffusion Experts) Model This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based processing with expert routing and diffusion-based action prediction. ## Model Architecture - Base Architecture: MoDE with custom Mixture of Experts Transformer - Vision Encoder: {getattr(model_instance, 'resnet_type', 'ResNet')} with FiLM conditioning - EMA: Enabled - Action Window Size: {model_instance.act_window_size} - Sampling Steps: {model_instance.num_sampling_steps} - Sampler Type: {model_instance.sampler_type} ## Input/Output Specifications - RGB Static Camera: (B, T, 3, H, W) tensor - RGB Gripper Camera: (B, T, 3, H, W) tensor - Language Instructions: Text strings - Output: (B, T, 7) tensor representing 7-DoF actions ## Usage Example ```python from huggingface_hub import hf_hub_download import torch weights_path = hf_hub_download(repo_id="{repo_name}", filename="model_cleaned.safetensors") model.load_pretrained_parameters(weights_path) obs = { "rgb_obs": { "rgb_static": static_image, "rgb_gripper": gripper_image } } goal = {"lang_text": "pick up the blue cube"} action = model.step(obs, goal) ``` ## Training Configuration - Optimizer: AdamW - Learning Rate: {config.optimizer.learning_rate} - Weight Decay: {config.optimizer.transformer_weight_decay}