Robotics
PyTorch
silwals commited on
Commit
fee04e8
1 Parent(s): 40b0117

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md CHANGED
@@ -1,3 +1,116 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+ Model Card: VC-1 (Visual Cortex ViT-Large)
5
+ Last updated: 2023-03-28
6
+
7
+ Version: 1.0
8
+
9
+ Code: https://github.com/facebookresearch/eai-vc
10
+ Other Links: VC-1 Website, VC-1 Blogpost, VC-1 Paper, VC-1 Demo
11
+ The VC-1 model is a vision transformer (ViT) pre-trained on over 4,000 hours of egocentric videos from 7 different sources, together with ImageNet. The model is trained using Masked Auto-Encoding (MAE) and is available in two sizes: ViT-B and ViT-L. The model is intended for use for EmbodiedAI tasks, such as object manipulation and indoor navigation.
12
+
13
+ Model Details
14
+ Model Name: VC-1 (Vision Transformer-based model)
15
+ Architecture:
16
+ Patch size: 16x16
17
+ Embedding dimension: 768
18
+ Number of layers: 12
19
+ Number of heads: 12
20
+ MLP ratio: 4
21
+ QKV bias: True
22
+ Layer normalization: eps=1e-6
23
+ Inputs: Images presented in 224x224x3.
24
+ Outputs: 768x1 embedding.
25
+ Image Size: 224
26
+ Use of Classification Token: True
27
+ Dropout Rate: 0.0
28
+ Algorithm: MAE
29
+ Epochs trained: 182
30
+ Model authors: Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier.
31
+ Person of Contact: Oleksandr Maksymets (FAIR)
32
+ Citation
33
+ If you use this model, please cite:
34
+
35
+ @inproceedings{majumdar2023vc1,
36
+ title = {Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?},
37
+ author = {Arjun Majumdar and Karmesh Yadav and Sergio Arnaud and Yecheng Jason Ma and Claire Chen and Sneha Silwal and Aryan Jain and Vincent-Pierre Berges and Pieter Abbeel and Jitendra Malik and Dhruv Batra and Yixin Lin and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier},
38
+ publisher = {arXiv},
39
+ year = {2023}
40
+ }
41
+
42
+ Model Data
43
+ Training data: The VC-1 model was trained on a large-scale dataset of egocentric videos, consisting of over 5.6 million frames. The dataset includes three modalities: manipulation, navigation, and object recognition. The manipulation modality includes videos of people performing various manipulations, such as cooking, cleaning, and tool use. The navigation modality includes videos of people moving around in indoor environments, such as homes and offices. The object recognition modality includes images from the ImageNet dataset, which contains over 1.2 million images of objects in various categories.
44
+
45
+ This table provides an overview of the assembled datasets used for scaling hypothesis experiments, including the total number of frames and the frames used for each dataset:
46
+
47
+ Dataset Contains Total Frames Frames used
48
+ Ego4D Ego4D 418,578,043 2,790,520
49
+ EgoM (Manipulation) Ego4D 418,578,043 2,790,520
50
+ 100DOH 99,899 99,899
51
+ SS-v2 25,209,271 315,115
52
+ Epic Kitchens 19,965,439 332,757
53
+ Total 3,538,291
54
+ EgoO (OpenHouse24) Ego4D 418,578,043 2,790,520
55
+ OpenHouse24 27,806,971 499,442
56
+ Total 3,289,962
57
+ EgoN (Navigation) Ego4D 418,578,043 2,790,520
58
+ OpenHouse24 27,806,971 499,442
59
+ RealEstate10K 10,000,000 303,087
60
+ Total 3,289,962
61
+ EgoMN (Manipulation, Navigation) Ego4D+M 3,538,291 3,538,291
62
+ OpenHouse24 27,806,971 499,442
63
+ RealEstate10K 10,000,000 303,087
64
+ Total 4,340,820
65
+ EgoMNI (Manipulation, Navigation, ImageNet) Ego4D+MN 4,340,820 4,340,820
66
+ ImageNet 1,281,167 1,281,167
67
+ Total 5,621,987
68
+ The VC-1 models were trained on EgoMNI (Manipulation, Navigation, ImageNet) assembled dataset.
69
+
70
+ Evaluation data (see also section Evaluation Results below): The mode was evaluated on CortexBench that includes 17 tasks from 7 benchmarks and described below:
71
+
72
+ Benchmark Tasks
73
+ Adroit Relocate, Reorient-Pen
74
+ MetaWorld Assembly, Bin-Picking, Button-Press, Drawer-Open, Hammer
75
+ DeepMind Control Finger-Spin, Reacher-Hard, Cheetah-Run, Walker-Stand, Walker-Walk
76
+ TriFinger Reach-Cube, Push-Cube
77
+ Habitat Image-Goal Navigation (ImageNav), Object-Goal Navigation (ObjectNav)
78
+ Habitat 2.0 Mobile Pick
79
+ Model Creation & Maintenance
80
+ The VC-1 model was created by pre-training ViT-B and ViT-L on a combination of egocentric videos and ImageNet using Masked Auto-Encoding (MAE). The model is maintained by the authors and is available for open-source use.
81
+
82
+ Model Usage
83
+ The VC-1 model is intended for EmbodiedAI tasks, such as object manipulation and indoor navigation.. The model outputs embeddings for image frame, which can be used as features for downstream tasks:
84
+
85
+ from vc_models.models.vit import model_utils
86
+
87
+ model,embd_size,model_transforms,model_info = model_utils.load_model(model_utils.VC1_BASE_NAME)
88
+
89
+ #the img loaded should be Bx3x250x250
90
+ img = your_function_here ...
91
+
92
+ #output will be of size Bx3x224x224
93
+ transformed_img = model_transforms(img)
94
+ #img will be 1x768
95
+ embedding = model(transformed_img)
96
+
97
+ Performance
98
+ The performance of the models on the CortexBench:
99
+
100
+ Model Adroit Meta-World DMControl Trifinger ObjectNav ImageNav Mobile Pick Mean Rank Mean Success
101
+ Ego4D (VIT-B) 48.7 ± 1.3 86.1 ± 2.1 64.1 ± 2.3 68.3 ± 1.1 46.8 ± 1.1 64.0 ± 0.7 57.4 ± 2.2 8.6 62.2
102
+ Ego4D (VIT-L) 50.0 ± 1.2 92.9 ± 2.4 60.8 ± 3.3 69.7 ± 0.5 47.6 ± 1.1 55.8 ± 0.8 67.6 ± 2.1 5.9 63.5
103
+ Ego4D+N (VIT-B) 50.0 ± 2.4 86.4 ± 2.9 59.5 ± 2.4 67.8 ± 1.3 54.7 ± 1.1 68.7 ± 0.7 59.4 ± 2.2 7.2 63.8
104
+ Ego4D+N (VIT-L) 54.0 ± 1.2 89.1 ± 2.9 66.4 ± 1.7 66.9 ± 0.4 57.4 ± 1.1 70.5 ± 0.7 65.2 ± 2.1 3.5 67.1
105
+ Ego4D+M (VIT-B) 51.3 ± 2.4 83.5 ± 2.6 64.3 ± 1.8 69.1 ± 0.4 47.3 ± 1.1 65.8 ± 0.7 59.8 ± 2.2 7.0 63.0
106
+ Ego4D+M (VIT-L) 52.0 ± 1.3 88.3 ± 3.2 64.7 ± 2.4 64.7 ± 0.9 47.3 ± 1.1 65.5 ± 0.7 68.6 ± 2.1 6.0 64.4
107
+ VC-1: Ego4D+MN (VIT-B) 48.7 ± 2.4 85.3 ± 5.2 64.2 ± 1.9 70.3 ± 0.5 52.8 ± 1.1 68.9 ± 0.7 58.6 ± 2.2 6.9 64.1
108
+ VC-1: Ego4D + MNI (VIT-L) 59.3 ± 5.2 88.8 ± 2.2 66.9 ± 1.4 71.7 ± 0.4 60.3 ± 1.1 70.3 ± 0.7 63.2 ± 2.2 2.4 68.7
109
+ Limitations
110
+ The VC-1 model has been evaluated on a limited set of benchmarks and may not perform as well on other tasks. While we have focused on masked auto-encoders as the pre-training objective and ViT as the architecture in our study, there may be other SSL algorithms that exhibit different scaling behaviors or superior performance on the proposed datasets in our benchmark.
111
+
112
+ Additionally, the VC-1 model is computationally expensive to train and may not be practical for all use cases. The large size of the model may also pose challenges for deployment on resource-constrained devices.
113
+
114
+ It is important to note that although we utilize real-world images and videos for pre-training our visual representation models (PVRs), the evaluation benchmarks used in this study serve as proxies for actual robotic tasks. Therefore, the performance of the PVR models on real robots may differ from the rankings established in this study. Further research is necessary to fully evaluate the effectiveness of these models in real-world scenarios.
115
+
116
+ Finally, while we have made efforts to ensure fairness and avoid bias in our benchmark selection, it is possible that certain demographics or use cases may not be adequately represented in our evaluation tasks. Future work could explore additional benchmarks that address a wider range of scenarios and demographics.