Model Card for 3D Diffuser Actor
A robot manipulation policy that marries diffusion modeling with 3D scene representations. 3D Diffuser Actor is trained and evaluated on RLBench or CALVIN simulation. We release all code, checkpoints, and details involved in training these models.
Model Details
The models released are the following:
Benchmark | Embedding dimension | Diffusion timestep |
---|---|---|
RLBench (PerAct) | 120 | 100 |
RLBench (GNFactor) | 120 | 100 |
CALVIN | 192 | 25 |
Model Description
- Developed by: Katerina Group at CMU
- Model type: a Diffusion model with 3D scene
- License: The code and model are released under MIT License
- Contact: ngkanats@andrew.cmu.edu
Model Sources [optional]
- Project Page: https://3d-diffuser-actor.github.io
- Repository: https://github.com/nickgkan/3d_diffuser_actor.git
- Paper: Link
Uses
Input format
3D Diffuser Actor takes the following inputs:
RGB observations
: a tensor of shape (batch_size, num_cameras, 3, H, W). The pixel values are in the range of [0, 1]Point cloud observation
: a tensor of shape (batch_size, num_cameras, 3, H, W).Instruction encodings
: a tensor of shape (batch_size, max_instruction_length, C). In this code base, the embedding dimensionC
is set to 512.curr_gripper
: a tensor of shape (batch_size, history_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).trajectory_mask
: a tensor of shape (batch_size, trajectory_length), which is only used to indicate the length of each trajectory. To predict keyposes, we just need to set its shape to (batch_size, 1).gt_trajectory
: a tensor of shape (batch_size, trajectory_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D). The input is only used during training.
Output format
The model returns the diffusion loss, when run_inference=False
, otherwise, it returns pose trajectory of shape (batch_size, trajectory_length, 8) when run_inference=True
.
Usage
For training, forward 3D Diffuser Actor with run_inference=False
> loss = model.forward(gt_trajectory,
trajectory_mask,
rgb_obs,
pcd_obs,
instruction,
curr_gripper,
run_inference=False)
For evaluation, forward 3D Diffuser Actor with run_inference=True
> fake_gt_trajectory = torch.full((1, trajectory_length, 7), 0).to(device)
> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.forward(fake_gt_trajectory,
trajectory_mask,
rgb_obs,
pcd_obs,
instruction,
curr_gripper,
run_inference=True)
Or you can forward the model with compute_trajectory
function
> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.compute_trajectory(trajectory_mask,
rgb_obs,
pcd_obs,
instruction,
curr_gripper)
Evaluation
Our model trained and evaluated on RLBench simulation with the PerAct setup:
RLBench (PerAct) | 3D Diffuser Actor | RVT |
---|---|---|
average | 81.3 | 62.9 |
open drawer | 89.6 | 71.2 |
slide block | 97.6 | 81.6 |
sweep to dustpan | 84.0 | 72.0 |
meat off grill | 96.8 | 88 |
turn tap | 99.2 | 93.6 |
put in drawer | 96.0 | 88.0 |
close jar | 96.0 | 52.0 |
drag stick | 100.0 | 99.2 |
stack blocks | 68.3 | 28.8 |
screw bulbs | 82.4 | 48.0 |
put in safe | 97.6 | 91.2 |
place wine | 93.6 | 91.0 |
put in cupboard | 85.6 | 49.6 |
sort shape | 44.0 | 36.0 |
push buttons | 98.4 | 100.0 |
insert peg | 65.6 | 11.2 |
stack cups | 47.2 | 26.4 |
place cups | 24.0 | 4.0 |
Our model trained and evaluated on RLBench simulation with the GNFactor setup:
RLBench (PerAct) | 3D Diffuser Actor | GNFactor |
---|---|---|
average | 78.4 | 31.7 |
open drawer | 89.3 | 76.0 |
sweep to dustpan | 894.7 | 25.0 |
close jar | 82.7 | 25.3 |
meat off grill | 88.0 | 57.3 |
turn tap | 80.0 | 50.7 |
slide block | 92.0 | 20.0 |
put in drawer | 77.3 | 0.0 |
drag stick | 98.7 | 37.3 |
push buttons | 69.3 | 18.7 |
stack blocks | 12.0 | 4.0 |
Our model trained and evaluated on CALVIN simulation (train with environment A, B, C and test on D):
RLBench (PerAct) | 3D Diffuser Actor | GR-1 | SuSIE |
---|---|---|---|
task 1 | 92.2 | 85.4 | 87.0 |
task 2 | 78.7 | 71.2 | 69.0 |
task 3 | 63.9 | 59.6 | 49.0 |
task 4 | 51.2 | 49.7 | 38.0 |
task 5 | 41.2 | 40.1 | 26.0 |
Citation [optional]
BibTeX:
@article{,
title={Action Diffusion with 3D Scene Representations},
author={Ke, Tsung-Wei and Gkanatsios, Nikolaos and Fragkiadaki, Katerina}
journal={Preprint},
year={2024}
}
Model Card Contact
For errors in this model card, contact Nikos or Tsung-Wei, {ngkanats, tsungwek} at andrew dot cmu dot edu.