Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation
Abstract
In the field of text-to-motion generation, Bert-type Masked Models (MoMask, MMM) currently produce higher-quality outputs compared to GPT-type autoregressive models (T2M-GPT). However, these Bert-type models often lack the streaming output capability required for applications in video game and multimedia environments, a feature inherent to GPT-type models. Additionally, they demonstrate weaker performance in out-of-distribution generation. To surpass the quality of BERT-type models while leveraging a GPT-type structure, without adding extra refinement models that complicate scaling data, we propose a novel architecture, Mogo (Motion Only Generate Once), which generates high-quality lifelike 3D human motions by training a single transformer model. Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes continuous motion sequences with high precision; 2) Hierarchical Causal Transformer, responsible for generating the base motion sequences in an autoregressive manner while simultaneously inferring residuals across different layers. Experimental results demonstrate that Mogo can generate continuous and cyclic motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10 seconds) length limitation of existing datasets like HumanML3D. On the HumanML3D test set, Mogo achieves a FID score of 0.079, outperforming both the GPT-type model T2M-GPT (FID = 0.116), AttT2M (FID = 0.112) and the BERT-type model MMM (FID = 0.080). Furthermore, our model achieves the best quantitative performance in out-of-distribution generation.
Community
Mogo is a GPT-type High-Quality 3D Human Motion Generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoMA: Compositional Human Motion Generation with Multi-modal Agents (2024)
- Rethinking Diffusion for Text-Driven Human Motion Generation (2024)
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation (2024)
- AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward (2024)
- BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis (2024)
- Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation (2024)
- SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper