|
## Dataset Configuration |
|
|
|
Please create a TOML file for dataset configuration. |
|
|
|
Image and video datasets are supported. The configuration file can include multiple datasets, either image or video datasets, with caption text files or metadata JSONL files. |
|
|
|
### Sample for Image Dataset with Caption Text Files |
|
|
|
```toml |
|
# resolution, caption_extension, batch_size, enable_bucket, bucket_no_upscale must be set in either general or datasets |
|
|
|
# general configurations |
|
[general] |
|
resolution = [960, 544] |
|
caption_extension = ".txt" |
|
batch_size = 1 |
|
enable_bucket = true |
|
bucket_no_upscale = false |
|
|
|
[[datasets]] |
|
image_directory = "/path/to/image_dir" |
|
|
|
# other datasets can be added here. each dataset can have different configurations |
|
``` |
|
|
|
### Sample for Image Dataset with Metadata JSONL File |
|
|
|
```toml |
|
# resolution, batch_size, enable_bucket, bucket_no_upscale must be set in either general or datasets |
|
# caption_extension is not required for metadata jsonl file |
|
# cache_directory is required for each dataset with metadata jsonl file |
|
|
|
# general configurations |
|
[general] |
|
resolution = [960, 544] |
|
batch_size = 1 |
|
enable_bucket = true |
|
bucket_no_upscale = false |
|
|
|
[[datasets]] |
|
image_jsonl_file = "/path/to/metadata.jsonl" |
|
cache_directory = "/path/to/cache_directory" |
|
|
|
# other datasets can be added here. each dataset can have different configurations |
|
``` |
|
|
|
JSONL file format for metadata: |
|
|
|
```json |
|
{"image_path": "/path/to/image1.jpg", "caption": "A caption for image1"} |
|
{"image_path": "/path/to/image2.jpg", "caption": "A caption for image2"} |
|
``` |
|
|
|
### Sample for Video Dataset with Caption Text Files |
|
|
|
```toml |
|
# resolution, caption_extension, target_frames, frame_extraction, frame_stride, frame_sample, batch_size, enable_bucket, bucket_no_upscale must be set in either general or datasets |
|
|
|
# general configurations |
|
[general] |
|
resolution = [960, 544] |
|
caption_extension = ".txt" |
|
batch_size = 1 |
|
enable_bucket = true |
|
bucket_no_upscale = false |
|
|
|
[[datasets]] |
|
video_directory = "/path/to/video_dir" |
|
target_frames = [1, 25, 45] |
|
frame_extraction = "head" |
|
|
|
# other datasets can be added here. each dataset can have different configurations |
|
``` |
|
|
|
### Sample for Video Dataset with Metadata JSONL File |
|
|
|
```toml |
|
# resolution, target_frames, frame_extraction, frame_stride, frame_sample, batch_size, enable_bucket, bucket_no_upscale must be set in either general or datasets |
|
# caption_extension is not required for metadata jsonl file |
|
# cache_directory is required for each dataset with metadata jsonl file |
|
|
|
# general configurations |
|
[general] |
|
resolution = [960, 544] |
|
batch_size = 1 |
|
enable_bucket = true |
|
bucket_no_upscale = false |
|
|
|
[[datasets]] |
|
video_jsonl_file = "/path/to/metadata.jsonl" |
|
target_frames = [1, 25, 45] |
|
frame_extraction = "head" |
|
cache_directory = "/path/to/cache_directory" |
|
|
|
# same metadata jsonl file can be used for multiple datasets |
|
[[datasets]] |
|
video_jsonl_file = "/path/to/metadata.jsonl" |
|
target_frames = [1] |
|
frame_stride = 10 |
|
cache_directory = "/path/to/cache_directory" |
|
|
|
# other datasets can be added here. each dataset can have different configurations |
|
``` |
|
|
|
JSONL file format for metadata: |
|
|
|
```json |
|
{"video_path": "/path/to/video1.mp4", "caption": "A caption for video1"} |
|
{"video_path": "/path/to/video2.mp4", "caption": "A caption for video2"} |
|
``` |
|
|
|
### fame_extraction Options |
|
|
|
- `head`: Extract the first N frames from the video. |
|
- `chunk`: Extract frames by splitting the video into chunks of N frames. |
|
- `slide`: Extract frames from the video with a stride of `frame_stride`. |
|
- `uniform`: Extract `frame_sample` samples uniformly from the video. |
|
|
|
For example, consider a video with 40 frames. The following diagrams illustrate each extraction: |
|
|
|
``` |
|
Original Video, 40 frames: x = frame, o = no frame |
|
oooooooooooooooooooooooooooooooooooooooo |
|
|
|
head, target_frames = [1, 13, 25] -> extract head frames: |
|
xooooooooooooooooooooooooooooooooooooooo |
|
xxxxxxxxxxxxxooooooooooooooooooooooooooo |
|
xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooo |
|
|
|
chunk, target_frames = [13, 25] -> extract frames by splitting into chunks, into 13 and 25 frames: |
|
xxxxxxxxxxxxxooooooooooooooooooooooooooo |
|
oooooooooooooxxxxxxxxxxxxxoooooooooooooo |
|
ooooooooooooooooooooooooooxxxxxxxxxxxxxo |
|
xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooo |
|
|
|
NOTE: Please do not include 1 in target_frames if you are using the frame_extraction "chunk". It will make the all frames to be extracted. |
|
|
|
slide, target_frames = [1, 13, 25], frame_stride = 10 -> extract N frames with a stride of 10: |
|
xooooooooooooooooooooooooooooooooooooooo |
|
ooooooooooxooooooooooooooooooooooooooooo |
|
ooooooooooooooooooooxooooooooooooooooooo |
|
ooooooooooooooooooooooooooooooxooooooooo |
|
xxxxxxxxxxxxxooooooooooooooooooooooooooo |
|
ooooooooooxxxxxxxxxxxxxooooooooooooooooo |
|
ooooooooooooooooooooxxxxxxxxxxxxxooooooo |
|
xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooo |
|
ooooooooooxxxxxxxxxxxxxxxxxxxxxxxxxooooo |
|
|
|
uniform, target_frames =[1, 13, 25], frame_sample = 4 -> extract `frame_sample` samples uniformly, N frames each: |
|
xooooooooooooooooooooooooooooooooooooooo |
|
oooooooooooooxoooooooooooooooooooooooooo |
|
oooooooooooooooooooooooooxoooooooooooooo |
|
ooooooooooooooooooooooooooooooooooooooox |
|
xxxxxxxxxxxxxooooooooooooooooooooooooooo |
|
oooooooooxxxxxxxxxxxxxoooooooooooooooooo |
|
ooooooooooooooooooxxxxxxxxxxxxxooooooooo |
|
oooooooooooooooooooooooooooxxxxxxxxxxxxx |
|
xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooo |
|
oooooxxxxxxxxxxxxxxxxxxxxxxxxxoooooooooo |
|
ooooooooooxxxxxxxxxxxxxxxxxxxxxxxxxooooo |
|
oooooooooooooooxxxxxxxxxxxxxxxxxxxxxxxxx |
|
``` |
|
|
|
## Specifications |
|
|
|
```toml |
|
# general configurations |
|
[general] |
|
resolution = [960, 544] # optional, [W, H], default is None. This is the default resolution for all datasets |
|
caption_extension = ".txt" # optional, default is None. This is the default caption extension for all datasets |
|
batch_size = 1 # optional, default is 1. This is the default batch size for all datasets |
|
enable_bucket = true # optional, default is false. Enable bucketing for datasets |
|
bucket_no_upscale = false # optional, default is false. Disable upscaling for bucketing. Ignored if enable_bucket is false |
|
|
|
### Image Dataset |
|
|
|
# sample image dataset with caption text files |
|
[[datasets]] |
|
image_directory = "/path/to/image_dir" |
|
caption_extension = ".txt" # required for caption text files, if general caption extension is not set |
|
resolution = [960, 544] # required if general resolution is not set |
|
batch_size = 4 # optional, overwrite the default batch size |
|
enable_bucket = false # optional, overwrite the default bucketing setting |
|
bucket_no_upscale = true # optional, overwrite the default bucketing setting |
|
cache_directory = "/path/to/cache_directory" # optional, default is None to use the same directory as the image directory. NOTE: caching is always enabled |
|
|
|
# sample image dataset with metadata **jsonl** file |
|
[[datasets]] |
|
image_jsonl_file = "/path/to/metadata.jsonl" # includes pairs of image files and captions |
|
resolution = [960, 544] # required if general resolution is not set |
|
cache_directory = "/path/to/cache_directory" # required for metadata jsonl file |
|
# caption_extension is not required for metadata jsonl file |
|
# batch_size, enable_bucket, bucket_no_upscale are also available for metadata jsonl file |
|
|
|
### Video Dataset |
|
|
|
# sample video dataset with caption text files |
|
[[datasets]] |
|
video_directory = "/path/to/video_dir" |
|
caption_extension = ".txt" # required for caption text files, if general caption extension is not set |
|
resolution = [960, 544] # required if general resolution is not set |
|
|
|
target_frames = [1, 25, 79] # required for video dataset. list of video lengths to extract frames. each element must be N*4+1 (N=0,1,2,...) |
|
|
|
# NOTE: Please do not include 1 in target_frames if you are using the frame_extraction "chunk". It will make the all frames to be extracted. |
|
|
|
frame_extraction = "head" # optional, "head" or "chunk", "slide", "uniform". Default is "head" |
|
frame_stride = 1 # optional, default is 1, available for "slide" frame extraction |
|
frame_sample = 4 # optional, default is 1 (same as "head"), available for "uniform" frame extraction |
|
# batch_size, enable_bucket, bucket_no_upscale, cache_directory are also available for video dataset |
|
|
|
# sample video dataset with metadata jsonl file |
|
[[datasets]] |
|
video_jsonl_file = "/path/to/metadata.jsonl" # includes pairs of video files and captions |
|
|
|
target_frames = [1, 79] |
|
|
|
cache_directory = "/path/to/cache_directory" # required for metadata jsonl file |
|
# frame_extraction, frame_stride, frame_sample are also available for metadata jsonl file |
|
``` |
|
|
|
<!-- |
|
# sample image dataset with lance |
|
[[datasets]] |
|
image_lance_dataset = "/path/to/lance_dataset" |
|
resolution = [960, 544] # required if general resolution is not set |
|
# batch_size, enable_bucket, bucket_no_upscale, cache_directory are also available for lance dataset |
|
--> |
|
|
|
The metadata with .json file will be supported in the near future. |
|
|
|
|
|
|
|
<!-- |
|
|
|
```toml |
|
# general configurations |
|
[general] |
|
resolution = [960, 544] # optional, [W, H], default is None. This is the default resolution for all datasets |
|
caption_extension = ".txt" # optional, default is None. This is the default caption extension for all datasets |
|
batch_size = 1 # optional, default is 1. This is the default batch size for all datasets |
|
enable_bucket = true # optional, default is false. Enable bucketing for datasets |
|
bucket_no_upscale = false # optional, default is false. Disable upscaling for bucketing. Ignored if enable_bucket is false |
|
|
|
# sample image dataset with caption text files |
|
[[datasets]] |
|
image_directory = "/path/to/image_dir" |
|
caption_extension = ".txt" # required for caption text files, if general caption extension is not set |
|
resolution = [960, 544] # required if general resolution is not set |
|
batch_size = 4 # optional, overwrite the default batch size |
|
enable_bucket = false # optional, overwrite the default bucketing setting |
|
bucket_no_upscale = true # optional, overwrite the default bucketing setting |
|
cache_directory = "/path/to/cache_directory" # optional, default is None to use the same directory as the image directory. NOTE: caching is always enabled |
|
|
|
# sample image dataset with metadata **jsonl** file |
|
[[datasets]] |
|
image_jsonl_file = "/path/to/metadata.jsonl" # includes pairs of image files and captions |
|
resolution = [960, 544] # required if general resolution is not set |
|
cache_directory = "/path/to/cache_directory" # required for metadata jsonl file |
|
# caption_extension is not required for metadata jsonl file |
|
# batch_size, enable_bucket, bucket_no_upscale are also available for metadata jsonl file |
|
|
|
# sample video dataset with caption text files |
|
[[datasets]] |
|
video_directory = "/path/to/video_dir" |
|
caption_extension = ".txt" # required for caption text files, if general caption extension is not set |
|
resolution = [960, 544] # required if general resolution is not set |
|
target_frames = [1, 25, 79] # required for video dataset. list of video lengths to extract frames. each element must be N*4+1 (N=0,1,2,...) |
|
frame_extraction = "head" # optional, "head" or "chunk", "slide", "uniform". Default is "head" |
|
frame_stride = 1 # optional, default is 1, available for "slide" frame extraction |
|
frame_sample = 4 # optional, default is 1 (same as "head"), available for "uniform" frame extraction |
|
# batch_size, enable_bucket, bucket_no_upscale, cache_directory are also available for video dataset |
|
|
|
# sample video dataset with metadata jsonl file |
|
[[datasets]] |
|
video_jsonl_file = "/path/to/metadata.jsonl" # includes pairs of video files and captions |
|
target_frames = [1, 79] |
|
cache_directory = "/path/to/cache_directory" # required for metadata jsonl file |
|
# frame_extraction, frame_stride, frame_sample are also available for metadata jsonl file |
|
``` |
|
|
|
# sample image dataset with lance |
|
[[datasets]] |
|
image_lance_dataset = "/path/to/lance_dataset" |
|
resolution = [960, 544] # required if general resolution is not set |
|
# batch_size, enable_bucket, bucket_no_upscale, cache_directory are also available for lance dataset |
|
|
|
The metadata with .json file will be supported in the near future. |
|
|
|
|
|
|
|
|
|
--> |