Model Card for 3D Diffuser Actor

A robot manipulation policy that marries diffusion modeling with 3D scene representations. 3D Diffuser Actor is trained and evaluated on RLBench or CALVIN simulation. We release all code, checkpoints, and details involved in training these models.

Model Details

The models released are the following:

Benchmark	Embedding dimension	Diffusion timestep
RLBench (PerAct)	120	100
RLBench (GNFactor)	120	100
CALVIN	192	25

Model Description

Developed by: Katerina Group at CMU
Model type: a Diffusion model with 3D scene
License: The code and model are released under MIT License
Contact: ngkanats@andrew.cmu.edu

Model Sources [optional]

Project Page: https://3d-diffuser-actor.github.io
Repository: https://github.com/nickgkan/3d_diffuser_actor.git
Paper: Link

Uses

Input format

3D Diffuser Actor takes the following inputs:

RGB observations: a tensor of shape (batch_size, num_cameras, 3, H, W). The pixel values are in the range of [0, 1]
Point cloud observation: a tensor of shape (batch_size, num_cameras, 3, H, W).
Instruction encodings: a tensor of shape (batch_size, max_instruction_length, C). In this code base, the embedding dimension C is set to 512.
curr_gripper: a tensor of shape (batch_size, history_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).
trajectory_mask: a tensor of shape (batch_size, trajectory_length), which is only used to indicate the length of each trajectory. To predict keyposes, we just need to set its shape to (batch_size, 1).
gt_trajectory: a tensor of shape (batch_size, trajectory_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D). The input is only used during training.

Output format

The model returns the diffusion loss, when run_inference=False, otherwise, it returns pose trajectory of shape (batch_size, trajectory_length, 8) when run_inference=True.

Usage

For training, forward 3D Diffuser Actor with run_inference=False

> loss = model.forward(gt_trajectory,
                       trajectory_mask,
                       rgb_obs,
                       pcd_obs,
                       instruction,
                       curr_gripper,
                       run_inference=False)

For evaluation, forward 3D Diffuser Actor with run_inference=True

> fake_gt_trajectory =  torch.full((1, trajectory_length, 7), 0).to(device)
> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.forward(fake_gt_trajectory,
                             trajectory_mask,
                             rgb_obs,
                             pcd_obs,
                             instruction,
                             curr_gripper,
                             run_inference=True)

Or you can forward the model with compute_trajectory function

> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.compute_trajectory(trajectory_mask,
                                        rgb_obs,
                                        pcd_obs,
                                        instruction,
                                        curr_gripper)

Evaluation

Our model trained and evaluated on RLBench simulation with the PerAct setup:

RLBench (PerAct)	3D Diffuser Actor	RVT
average	81.3	62.9
open drawer	89.6	71.2
slide block	97.6	81.6
sweep to dustpan	84.0	72.0
meat off grill	96.8	88
turn tap	99.2	93.6
put in drawer	96.0	88.0
close jar	96.0	52.0
drag stick	100.0	99.2
stack blocks	68.3	28.8
screw bulbs	82.4	48.0
put in safe	97.6	91.2
place wine	93.6	91.0
put in cupboard	85.6	49.6
sort shape	44.0	36.0
push buttons	98.4	100.0
insert peg	65.6	11.2
stack cups	47.2	26.4
place cups	24.0	4.0

Our model trained and evaluated on RLBench simulation with the GNFactor setup:

RLBench (PerAct)	3D Diffuser Actor	GNFactor
average	78.4	31.7
open drawer	89.3	76.0
sweep to dustpan	894.7	25.0
close jar	82.7	25.3
meat off grill	88.0	57.3
turn tap	80.0	50.7
slide block	92.0	20.0
put in drawer	77.3	0.0
drag stick	98.7	37.3
push buttons	69.3	18.7
stack blocks	12.0	4.0

Our model trained and evaluated on CALVIN simulation (train with environment A, B, C and test on D):

RLBench (PerAct)	3D Diffuser Actor	GR-1	SuSIE
task 1	92.2	85.4	87.0
task 2	78.7	71.2	69.0
task 3	63.9	59.6	49.0
task 4	51.2	49.7	38.0
task 5	41.2	40.1	26.0

Citation [optional]

BibTeX:

@article{,
  title={Action Diffusion with 3D Scene Representations},
  author={Ke, Tsung-Wei and Gkanatsios, Nikolaos and Fragkiadaki, Katerina}
  journal={Preprint},
  year={2024}
}

Model Card Contact

For errors in this model card, contact Nikos or Tsung-Wei, {ngkanats, tsungwek} at andrew dot cmu dot edu.