MotionBERT / README.md

walterzhu

Update README.md

370a919 over 1 year ago

preview code

raw

history blame contribute delete

No virus

5.52 kB

	---
	thumbnail: "https://motionbert.github.io/assets/teaser.gif"
	tags:
	- 3D Human Pose Estimation
	- Skeleton-based Action Recognition
	- Mesh Recovery
	arxiv: "2210.06551"
	---

	# MotionBERT

	This is the official PyTorch implementation of the paper "[Learning Human Motion Representations: A Unified Perspective](https://arxiv.org/pdf/2210.06551.pdf)".

	<img src="https://motionbert.github.io/assets/teaser.gif" alt="" style="zoom: 60%;" />

	## Installation

	```bash
	conda create -n motionbert python=3.7 anaconda
	conda activate motionbert
	# Please install PyTorch according to your CUDA version.
	conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
	pip install -r requirements.txt
	```

	## Getting Started

	\| Task \| Document \|
	\| --------------------------------- \| ------------------------------------------------------------ \|
	\| Pretrain \| [docs/pretrain.md](docs/pretrain.md) \|
	\| 3D human pose estimation \| [docs/pose3d.md](docs/pose3d.md) \|
	\| Skeleton-based action recognition \| [docs/action.md](docs/action.md) \|
	\| Mesh recovery \| [docs/mesh.md](docs/mesh.md) \|



	## Applications

	### In-the-wild inference (for custom videos)

	Please refer to [docs/inference.md](docs/inference.md).

	### Using MotionBERT for human-centric video representations

	```python
	'''
	x: 2D skeletons
	type = <class 'torch.Tensor'>
	shape = [batch size * frames * joints(17) * channels(3)]

	MotionBERT: pretrained human motion encoder
	type = <class 'lib.model.DSTformer.DSTformer'>

	E: encoded motion representation
	type = <class 'torch.Tensor'>
	shape = [batch size * frames * joints(17) * channels(512)]
	'''
	E = MotionBERT.get_representation(x)
	```



	> Hints
	>
	> 1. The model could handle different input lengths (no more than 243 frames). No need to explicitly specify the input length elsewhere.
	> 2. The model uses 17 body keypoints ([H36M format](https://github.com/JimmySuen/integral-human-pose/blob/master/pytorch_projects/common_pytorch/dataset/hm36.py#L32)). If you are using other formats, please convert them before feeding to MotionBERT.
	> 3. Please refer to [model_action.py](lib/model/model_action.py) and [model_mesh.py](lib/model/model_mesh.py) for examples of (easily) adapting MotionBERT to different downstream tasks.
	> 4. For RGB videos, you need to extract 2D poses ([inference.md](docs/inference.md)), convert the keypoint format ([dataset_wild.py](lib/data/dataset_wild.py)), and then feed to MotionBERT ([infer_wild.py](infer_wild.py)).
	>



	## Model Zoo

	<img src="https://motionbert.github.io/assets/demo.gif" alt="" style="zoom: 50%;" />

	\| Model \| Download Link \| Config \| Performance \|
	\| ------------------------------- \| ------------------------------------------------------------ \| ------------------------------------------------------------ \| ---------------- \|
	\| MotionBERT (162MB) \| [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pretrain/MB_release/latest_epoch.bin) \| [pretrain/MB_pretrain.yaml](configs/pretrain/MB_pretrain.yaml) \| - \|
	\| MotionBERT-Lite (61MB) \| [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pretrain/MB_lite/latest_epoch.bin) \| [pretrain/MB_lite.yaml](configs/pretrain/MB_lite.yaml) \| - \|
	\| 3D Pose (H36M-SH, scratch) \| [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pose3d/MB_train_h36m/best_epoch.bin) \| [pose3d/MB_train_h36m.yaml](configs/pose3d/MB_train_h36m.yaml) \| 39.2mm (MPJPE) \|
	\| 3D Pose (H36M-SH, ft) \| [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pose3d/FT_MB_release_MB_ft_h36m/best_epoch.bin) \| [pose3d/MB_ft_h36m.yaml](configs/pose3d/MB_ft_h36m.yaml) \| 37.2mm (MPJPE) \|
	\| Action Recognition (x-sub, ft) \| [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/action/FT_MB_release_MB_ft_NTU60_xsub/best_epoch.bin) \| [action/MB_ft_NTU60_xsub.yaml](configs/action/MB_ft_NTU60_xsub.yaml) \| 97.2% (Top1 Acc) \|
	\| Action Recognition (x-view, ft) \| [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/action/FT_MB_release_MB_ft_NTU60_xview/best_epoch.bin) \| [action/MB_ft_NTU60_xview.yaml](configs/action/MB_ft_NTU60_xview.yaml) \| 93.0% (Top1 Acc) \|
	\| Mesh (with 3DPW, ft) \| [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/mesh/FT_MB_release_MB_ft_pw3d/best_epoch.bin) \| [mesh/MB_ft_pw3d.yaml](configs/mesh/MB_ft_pw3d.yaml) \| 88.1mm (MPVE) \|

	In most use cases (especially with finetuning), `MotionBERT-Lite` gives a similar performance with lower computation overhead.



	## TODO

	- [x] Scripts and docs for pretraining

	- [x] Demo for custom videos



	## Citation

	If you find our work useful for your project, please consider citing the paper:

	```bibtex
	@article{motionbert2022,
	title = {Learning Human Motion Representations: A Unified Perspective},
	author = {Zhu, Wentao and Ma, Xiaoxuan and Liu, Zhaoyang and Liu, Libin and Wu, Wayne and Wang, Yizhou},
	year = {2022},
	journal = {arXiv preprint arXiv:2210.06551},
	}
	```