Genie is a new method from Google DeepMind that generates interactive, action-controllable virtual worlds from unlabelled internet videos using.
Keypoints: * Genie leverages a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model to generate controllable video environments. * The model is trained on video data alone, without requiring action labels, using unsupervised learning to infer latent actions between frames. * The method restricts the size of the action vocabulary to 8 to ensure that the number of possible latent actions remains small. * The dataset used for training is generated by filtering publicly available internet videos with specific criteria related to 2D platformer games for a total of 6.8M videos used for training.