Spaces:

kadirnar
/

Open-Sora

Runtime error

App Files Files Community

Open-Sora / docs /structure.md

kadirnar

Upload 98 files

e7d5680 verified 8 months ago

preview code

raw

history blame

7.71 kB

	# Repo & Config Structure

	## Repo Structure

	```plaintext
	Open-Sora
	├── README.md
	├── docs
	│ ├── acceleration.md -> Acceleration & Speed benchmark
	│ ├── command.md -> Commands for training & inference
	│ ├── datasets.md -> Datasets used in this project
	│ ├── structure.md -> This file
	│ └── report_v1.md -> Report for Open-Sora v1
	├── scripts
	│ ├── train.py -> diffusion training script
	│ └── inference.py -> Report for Open-Sora v1
	├── configs -> Configs for training & inference
	├── opensora
	│ ├── __init__.py
	│ ├── registry.py -> Registry helper
	│ ├── acceleration -> Acceleration related code
	│ ├── dataset -> Dataset related code
	│ ├── models
	│ │ ├── layers -> Common layers
	│ │ ├── vae -> VAE as image encoder
	│ │ ├── text_encoder -> Text encoder
	│ │ │ ├── classes.py -> Class id encoder (inference only)
	│ │ │ ├── clip.py -> CLIP encoder
	│ │ │ └── t5.py -> T5 encoder
	│ │ ├── dit
	│ │ ├── latte
	│ │ ├── pixart
	│ │ └── stdit -> Our STDiT related code
	│ ├── schedulers -> Diffusion shedulers
	│ │ ├── iddpm -> IDDPM for training and inference
	│ │ └── dpms -> DPM-Solver for fast inference
	│ └── utils
	└── tools -> Tools for data processing and more
	```

	## Configs

	Our config files follows [MMEgine](https://github.com/open-mmlab/mmengine). MMEngine will reads the config file (a `.py` file) and parse it into a dictionary-like object.

	```plaintext
	Open-Sora
	└── configs -> Configs for training & inference
	├── opensora -> STDiT related configs
	│ ├── inference
	│ │ ├── 16x256x256.py -> Sample videos 16 frames 256x256
	│ │ ├── 16x512x512.py -> Sample videos 16 frames 512x512
	│ │ └── 64x512x512.py -> Sample videos 64 frames 512x512
	│ └── train
	│ ├── 16x256x256.py -> Train on videos 16 frames 256x256
	│ ├── 16x256x256.py -> Train on videos 16 frames 256x256
	│ └── 64x512x512.py -> Train on videos 64 frames 512x512
	├── dit -> DiT related configs
	│ ├── inference
	│ │ ├── 1x256x256-class.py -> Sample images with ckpts from DiT
	│ │ ├── 1x256x256.py -> Sample images with clip condition
	│ │ └── 16x256x256.py -> Sample videos
	│ └── train
	│ ├── 1x256x256.py -> Train on images with clip condition
	│ └── 16x256x256.py -> Train on videos
	├── latte -> Latte related configs
	└── pixart -> PixArt related configs
	```

	## Inference config demos

	To change the inference settings, you can directly modify the corresponding config file. Or you can pass arguments to overwrite the config file ([config_utils.py](/opensora/utils/config_utils.py)). To change sampling prompts, you should modify the `.txt` file passed to the `--prompt_path` argument.

	```plaintext
	--prompt_path ./assets/texts/t2v_samples.txt -> prompt_path
	--ckpt-path ./path/to/your/ckpt.pth -> model["from_pretrained"]
	```

	The explanation of each field is provided below.

	```python
	# Define sampling size
	num_frames = 64 # number of frames
	fps = 24 // 2 # frames per second (divided by 2 for frame_interval=2)
	image_size = (512, 512) # image size (height, width)

	# Define model
	model = dict(
	type="STDiT-XL/2", # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
	space_scale=1.0, # (Optional) Space positional encoding scale (new height / old height)
	time_scale=2 / 3, # (Optional) Time positional encoding scale (new frame_interval / old frame_interval)
	enable_flashattn=True, # (Optional) Speed up training and inference with flash attention
	enable_layernorm_kernel=True, # (Optional) Speed up training and inference with fused kernel
	from_pretrained="PRETRAINED_MODEL", # (Optional) Load from pretrained model
	no_temporal_pos_emb=True, # (Optional) Disable temporal positional encoding (for image)
	)
	vae = dict(
	type="VideoAutoencoderKL", # Select VAE type
	from_pretrained="stabilityai/sd-vae-ft-ema", # Load from pretrained VAE
	micro_batch_size=128, # VAE with micro batch size to save memory
	)
	text_encoder = dict(
	type="t5", # Select text encoder type (t5, clip)
	from_pretrained="./pretrained_models/t5_ckpts", # Load from pretrained text encoder
	model_max_length=120, # Maximum length of input text
	)
	scheduler = dict(
	type="iddpm", # Select scheduler type (iddpm, dpm-solver)
	num_sampling_steps=100, # Number of sampling steps
	cfg_scale=7.0, # hyper-parameter for classifier-free diffusion
	)
	dtype = "fp16" # Computation type (fp16, fp32, bf16)

	# Other settings
	batch_size = 1 # batch size
	seed = 42 # random seed
	prompt_path = "./assets/texts/t2v_samples.txt" # path to prompt file
	save_dir = "./samples" # path to save samples
	```

	## Training config demos

	```python
	# Define sampling size
	num_frames = 64
	frame_interval = 2 # sample every 2 frames
	image_size = (512, 512)

	# Define dataset
	root = None # root path to the dataset
	data_path = "CSV_PATH" # path to the csv file
	use_image_transform = False # True if training on images
	num_workers = 4 # number of workers for dataloader

	# Define acceleration
	dtype = "bf16" # Computation type (fp16, bf16)
	grad_checkpoint = True # Use gradient checkpointing
	plugin = "zero2" # Plugin for distributed training (zero2, zero2-seq)
	sp_size = 1 # Sequence parallelism size (1 for no sequence parallelism)

	# Define model
	model = dict(
	type="STDiT-XL/2",
	space_scale=1.0,
	time_scale=2 / 3,
	from_pretrained="YOUR_PRETRAINED_MODEL",
	enable_flashattn=True, # Enable flash attention
	enable_layernorm_kernel=True, # Enable layernorm kernel
	)
	vae = dict(
	type="VideoAutoencoderKL",
	from_pretrained="stabilityai/sd-vae-ft-ema",
	micro_batch_size=128,
	)
	text_encoder = dict(
	type="t5",
	from_pretrained="./pretrained_models/t5_ckpts",
	model_max_length=120,
	shardformer=True, # Enable shardformer for T5 acceleration
	)
	scheduler = dict(
	type="iddpm",
	timestep_respacing="", # Default 1000 timesteps
	)

	# Others
	seed = 42
	outputs = "outputs" # path to save checkpoints
	wandb = False # Use wandb for logging

	epochs = 1000 # number of epochs (just large enough, kill when satisfied)
	log_every = 10
	ckpt_every = 250
	load = None # path to resume training

	batch_size = 4
	lr = 2e-5
	grad_clip = 1.0 # gradient clipping
	```