+ Credits for Quantized version | + Methodology | + Capabilities | + License | + Citation +
+ + + +## 1. Overview + +OmniGen is a unified image generation model that can generate a wide range of images from multi-modal prompts. It is designed to be simple, flexible and easy to use. We provide [inference code](#5-quick-start) so that everyone can explore more functionalities of OmniGen. + +Existing image generation models often require loading several additional network modules (such as ControlNet, IP-Adapter, Reference-Net, etc.) and performing extra preprocessing steps (e.g., face detection, pose estimation, cropping, etc.) to generate a satisfactory image. However, **we believe that the future image generation paradigm should be more simple and flexible, that is, generating various images directly through arbitrarily multi-modal instructions without the need for additional plugins and operations, similar to how GPT works in language generation.** + +Due to the limited resources, OmniGen still has room for improvement. We will continue to optimize it, and hope it inspire more universal image generation models. You can also easily fine-tune OmniGen without worrying about designing networks for specific tasks; you just need to prepare the corresponding data, and then run the [script](#6-finetune). Imagination is no longer limited; everyone can construct any image generation task, and perhaps we can achieve very interesting, wonderful and creative things. + +If you have any questions, ideas or interesting tasks you want OmniGen to accomplish, feel free to discuss with us: 2906698981@qq.com, wangyueze@tju.edu.cn, zhengliu1026@gmail.com. We welcome any feedback to help us improve the model. + + + +## 2. Credits for Quantized version +- https://github.com/Manni1000 + + + +## 3. Methodology + +You can see details in our [paper](https://arxiv.org/abs/2409.11340). + + +## 4. What Can OmniGen do? + + +OmniGen is a unified image generation model that you can use to perform various tasks, including but not limited to text-to-image generation, subject-driven generation, Identity-Preserving Generation, image editing, and image-conditioned generation. **OmniGen don't need additional plugins or operations, it can automatically identify the features (e.g., required object, human pose, depth mapping) in input images according the text prompt.** +We showcase some examples in [inference.ipynb](inference.ipynb). And in [inference_demo.ipynb](inference_demo.ipynb), we show an interesting pipeline to generate and modify a image. + +Here is the illustration of OmniGen's capabilities: +- You can control the image generation flexibly via OmniGen +![demo](./imgs/demo_cases.png) +- Referring Expression Generation: You can generate images by simply referring to objects, and OmniGen will automatically recognize the required objects in the image. +![demo](./imgs/referring.png) + +If you are not entirely satisfied with certain functionalities or wish to add new capabilities, you can try [fine-tuning OmniGen](#6-finetune). + + + +## 5. Quick Start + +### Please refer youtube video for installation + +https://www.youtube.com/watch?v=9ZXmXA2AJZ4 + + +## License +This repo is licensed under the [MIT License](LICENSE). + + +## Citation +If you find this repository useful, please consider giving a star ⭐ and citation +``` +@article{xiao2024omnigen, + title={Omnigen: Unified image generation}, + author={Xiao, Shitao and Wang, Yueze and Zhou, Junjie and Yuan, Huaying and Xing, Xingrun and Yan, Ruiran and Wang, Shuting and Huang, Tiejun and Liu, Zheng}, + journal={arXiv preprint arXiv:2409.11340}, + year={2024} +} +``` + + + + + diff --git a/app.py b/app.py new file mode 100644 index 0000000000000000000000000000000000000000..c20e87f959835ba4174f53f6aee56264fbdaa93a --- /dev/null +++ b/app.py @@ -0,0 +1,359 @@ +import gradio as gr +from PIL import Image +import os +from threading import Lock + +from OmniGen import OmniGenPipeline + +class OmniGenManager: + def __init__(self): + self.pipe = None + self.lock = Lock() + self.current_quantization = None + + def get_pipeline(self, quantization: bool) -> OmniGenPipeline: + """ + Get or initialize the pipeline with the specified quantization setting. + Uses a lock to ensure thread safety. + """ + with self.lock: + # Only reinitialize if quantization setting changed or pipeline doesn't exist + if self.pipe is None or self.current_quantization != quantization: + # Clear any existing pipeline + if self.pipe is not None: + del self.pipe + self.pipe = None + + # Initialize new pipeline + self.pipe = OmniGenPipeline.from_pretrained( + "Shitao/OmniGen-v1", + Quantization=quantization + ) + self.current_quantization = quantization + + return self.pipe + +# Create a single instance of the manager +pipeline_manager = OmniGenManager() + +def generate_image(text, img1, img2, img3, height, width, guidance_scale, img_guidance_scale, inference_steps, seed, quantization): + input_images = [img1, img2, img3] + # 去除 None + input_images = [img for img in input_images if img is not None] + if len(input_images) == 0: + input_images = None + + # Get or initialize pipeline with current settings + pipe = pipeline_manager.get_pipeline(quantization) + + # Generate image + output = pipe( + prompt=text, + input_images=input_images, + height=height, + width=width, + guidance_scale=guidance_scale, + img_guidance_scale=1.6, + num_inference_steps=inference_steps, + separate_cfg_infer=True, # set False can speed up the inference process + use_kv_cache=False, + seed=seed, + ) + img = output[0] + return img +# def generate_image(text, img1, img2, img3, height, width, guidance_scale, inference_steps): +# input_images = [] +# if img1: +# input_images.append(Image.open(img1)) +# if img2: +# input_images.append(Image.open(img2)) +# if img3: +# input_images.append(Image.open(img3)) + +# return input_images[0] if input_images else None + + +def get_example(): + case = [ + [ + "A curly-haired man in a red shirt is drinking tea.", + None, + None, + None, + 1024, + 1024, + 2.5, + 1.6, + 50, + 0, + ], + [ + "The woman in <|image_1|> waves her hand happily in the crowd", + "./imgs/test_cases/zhang.png", + None, + None, + 1024, + 1024, + 2.5, + 1.9, + 50, + 128, + ], + [ + "A man in a black shirt is reading a book. The man is the right man in <|image_1|>.", + "./imgs/test_cases/two_man.jpg", + None, + None, + 1024, + 1024, + 2.5, + 1.6, + 50, + 0, + ], + [ + "Two woman are raising fried chicken legs in a bar. A woman is <|image_1|>. The other woman is <|image_2|>.", + "./imgs/test_cases/mckenna.jpg", + "./imgs/test_cases/Amanda.jpg", + None, + 1024, + 1024, + 2.5, + 1.8, + 50, + 168, + ], + [ + "A man and a short-haired woman with a wrinkled face are standing in front of a bookshelf in a library. The man is the man in the middle of <|image_1|>, and the woman is oldest woman in <|image_2|>", + "./imgs/test_cases/1.jpg", + "./imgs/test_cases/2.jpg", + None, + 1024, + 1024, + 2.5, + 1.6, + 50, + 60, + ], + [ + "A man and a woman are sitting at a classroom desk. The man is the man with yellow hair in <|image_1|>. The woman is the woman on the left of <|image_2|>", + "./imgs/test_cases/3.jpg", + "./imgs/test_cases/4.jpg", + None, + 1024, + 1024, + 2.5, + 1.8, + 50, + 66, + ], + [ + "The flower <|image_1|><\/img> is placed in the vase which is in the middle of <|image_2|><\/img> on a wooden table of a living room", + "./imgs/test_cases/rose.jpg", + "./imgs/test_cases/vase.jpg", + None, + 1024, + 1024, + 2.5, + 1.6, + 50, + 0, + ], + [ + "<|image_1|>\n Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola.", + "./imgs/demo_cases/t2i_woman_with_book.png", + None, + None, + 1024, + 1024, + 2.5, + 1.6, + 50, + 222, + ], + [ + "Detect the skeleton of human in this image: <|image_1|>.", + "./imgs/test_cases/control.jpg", + None, + None, + 1024, + 1024, + 2.0, + 1.6, + 50, + 0, + ], + [ + "Generate a new photo using the following picture and text as conditions: <|image_1|>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him.", + "./imgs/demo_cases/skeletal.png", + None, + None, + 1024, + 1024, + 2, + 1.6, + 50, + 42, + ], + [ + "Following the pose of this image <|image_1|>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him.", + "./imgs/demo_cases/edit.png", + None, + None, + 1024, + 1024, + 2.0, + 1.6, + 50, + 123, + ], + [ + "Following the depth mapping of this image <|image_1|>, generate a new photo: A young girl is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him.", + "./imgs/demo_cases/edit.png", + None, + None, + 1024, + 1024, + 2.0, + 1.6, + 50, + 1, + ], + [ + "<|image_1|><\/img> What item can be used to see the current time? Please remove it.", + "./imgs/test_cases/watch.jpg", + None, + None, + 1024, + 1024, + 2.5, + 1.6, + 50, + 0, + ], + [ + "According to the following examples, generate an output for the input.\nInput: <|image_1|>\nOutput: <|image_2|>\n\nInput: <|image_3|>\nOutput: ", + "./imgs/test_cases/icl1.jpg", + "./imgs/test_cases/icl2.jpg", + "./imgs/test_cases/icl3.jpg", + 1024, + 1024, + 2.5, + 1.6, + 50, + 1, + ], + ] + return case + +def run_for_examples(text, img1, img2, img3, height, width, guidance_scale, img_guidance_scale, inference_steps, seed): + return generate_image(text, img1, img2, img3, height, width, guidance_scale, img_guidance_scale, inference_steps, seed) + +description = """ +OmniGen is a unified image generation model that you can use to perform various tasks, including but not limited to text-to-image generation, subject-driven generation, Identity-Preserving Generation, and image-conditioned generation. + +For multi-modal to image generation, you should pass a string as `prompt`, and a list of image paths as `input_images`. The placeholder in the prompt should be in the format of `<|image_*|>` (for the first image, the placeholder is <|image_1|>. for the second image, the the placeholder is <|image_2|>). +For example, use an image of a woman to generate a new image: +prompt = "A woman holds a bouquet of flowers and faces the camera. Thw woman is \\<|image_1|\>\." + +Tips: +- Oversaturated: If the image appears oversaturated, please reduce the `guidance_scale`. +- Low-quality: More detailed prompt will lead to better results. +- Animate Style: If the genereate images is in animate style, you can try to add `photo` to the prompt`. +- Edit generated image. If you generate a image by omnigen and then want to edit it, you cannot use the same seed to edit this image. For example, use seed=0 to generate image, and should use seed=1 to edit this image. +- For image editing tasks, we recommend placing the image before the editing instruction. For example, use `<|image_1|> remove suit`, rather than `remove suit <|image_1|>`. +""" + +# Gradio 接口 +with gr.Blocks() as demo: + gr.Markdown("# OmniGen: Unified Image Generation [paper](https://arxiv.org/abs/2409.11340) [code](https://github.com/VectorSpaceLab/OmniGen)") + gr.Markdown(description) + with gr.Row(): + with gr.Column(): + # 文本输入框 + prompt_input = gr.Textbox( + label="Enter your prompt, use <|image_i|> to represent i-th input image", placeholder="Type your prompt here..." + ) + + with gr.Row(equal_height=True): + # 图片上传框 + image_input_1 = gr.Image(label="<|image_1|>", type="filepath") + image_input_2 = gr.Image(label="<|image_2|>", type="filepath") + image_input_3 = gr.Image(label="<|image_3|>", type="filepath") + + # 高度和宽度滑块 + height_input = gr.Slider( + label="Height", minimum=256, maximum=2048, value=1024, step=16 + ) + width_input = gr.Slider( + label="Width", minimum=256, maximum=2048, value=1024, step=16 + ) + + # 引导尺度输入 + guidance_scale_input = gr.Slider( + label="Guidance Scale", minimum=1.0, maximum=5.0, value=2.5, step=0.1 + ) + + img_guidance_scale_input = gr.Slider( + label="img_guidance_scale", minimum=1.0, maximum=2.0, value=1.6, step=0.1 + ) + + num_inference_steps = gr.Slider( + label="Inference Steps", minimum=1, maximum=100, value=50, step=1 + ) + + Quantization = gr.Checkbox( + label="Low VRAM (8-bit Quantization)", value=True + ) + + seed_input = gr.Slider( + label="Seed", minimum=0, maximum=2147483647, value=42, step=1 + ) + + # 生成按钮 + generate_button = gr.Button("Generate Image") + + with gr.Column(): + # 输出图像框 + output_image = gr.Image(label="Output Image") + + # 按钮点击事件 + generate_button.click( + generate_image, + inputs=[ + prompt_input, + image_input_1, + image_input_2, + image_input_3, + height_input, + width_input, + guidance_scale_input, + img_guidance_scale_input, + num_inference_steps, + seed_input, + Quantization, + ], + outputs=output_image, + ) + + gr.Examples( + examples=get_example(), + fn=run_for_examples, + inputs=[ + prompt_input, + image_input_1, + image_input_2, + image_input_3, + height_input, + width_input, + guidance_scale_input, + img_guidance_scale_input, + num_inference_steps, + seed_input, + Quantization, + ], + outputs=output_image, + ) + +# 启动应用 +demo.launch() \ No newline at end of file diff --git a/docs/fine-tuning.md b/docs/fine-tuning.md new file mode 100644 index 0000000000000000000000000000000000000000..4f690c9fe2006014b78788de5e62fd6c03e840fe --- /dev/null +++ b/docs/fine-tuning.md @@ -0,0 +1,172 @@ +# Fine-tuning OmniGen + +Fine-tuning Omnigen can better help you handle specific image generation tasks. For example, by fine-tuning on a person's images, you can generate multiple pictures of that person while maintaining task consistency. + +A lot of previous work focused on designing new networks to facilitate specific tasks. For instance, ControlNet was proposed to handle image conditions, and IP-Adapter was constructed to maintain ID features. If you want to perform new tasks, you need to build new architectures and repeatedly debug them. Adding and adjusting extra network parameters is usually time-consuming and labor-intensive, which is not user-friendly and cost-efficient enough. However, with Omnigen, all of this becomes very simple. + +By comparison, Omnigen can accept multi-modal conditional inputs and has been pre-trained on various tasks. You can fine-tune it on any task without designing specialized networks like ControlNet or IP-Adapter for a specific task. + +**All you need to do is prepare the data and start training. You can break the limitations of previous models, allowing Omnigen to accomplish a variety of interesting tasks, even those that have never been done before.** + + +## Installation + +```bash +git clone https://github.com/VectorSpaceLab/OmniGen.git +cd OmniGen +pip install -e . +``` + + +## Full fine-tuning + +### Fine-tuning command + +```bash +accelerate launch \ + --num_processes=1 \ + --use_fsdp \ + --fsdp_offload_params false \ + --fsdp_sharding_strategy SHARD_GRAD_OP \ + --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP \ + --fsdp_transformer_layer_cls_to_wrap Phi3DecoderLayer \ + --fsdp_state_dict_type FULL_STATE_DICT \ + --fsdp_forward_prefetch false \ + --fsdp_use_orig_params True \ + --fsdp_cpu_ram_efficient_loading false \ + --fsdp_sync_module_states True \ + train.py \ + --model_name_or_path Shitao/OmniGen-v1 \ + --json_file ./toy_data/toy_data.jsonl \ + --image_path ./toy_data/images \ + --batch_size_per_device 1 \ + --lr 2e-5 \ + --keep_raw_resolution \ + --max_image_size 1024 \ + --gradient_accumulation_steps 1 \ + --ckpt_every 100 \ + --epochs 100 \ + --log_every 1 \ + --results_dir ./results/toy_finetune +``` + +Some important arguments: +- `num_processes`: number of GPU to use for training +- `model_name_or_path`: path to the pretrained model +- `json_file`: path to the json file containing the training data, e.g., ./toy_data/toy_data.jsonl +- `image_path`: path to the image folder, e.g., ./toy_data/images +- `batch_size_per_device`: batch size per device +- `lr`: learning rate +- `keep_raw_resolution`: whether to keep the original resolution of the image, if not, all images will be resized to (max_image_size, max_image_size) +- `max_image_size`: max image size +- `gradient_accumulation_steps`: number of steps to accumulate gradients +- `ckpt_every`: number of steps to save checkpoint +- `epochs`: number of epochs +- `log_every`: number of steps to log +- `results_dir`: path to the results folder + +The data format of json_file is as follows: +``` +{ + "instruction": str, + "input_images": [str, str, ...], + "output_images": str +} +``` +You can see a toy example in `./toy_data/toy_data.jsonl`. + +If an OOM(Out of Memory) issue occurs, you can try to decrease the `batch_size_per_device` or `max_image_size`. You can also try to use LoRA instead of full fine-tuning. + + +### Inference + +The checkpoint can be found at `{results_dir}/checkpoints/*`. You can use the following command to load saved checkpoint: +```python +from OmniGen import OmniGenPipeline + +pipe = OmniGenPipeline.from_pretrained("checkpoint_path") # e.g., ./results/toy_finetune/checkpoints/0000200 +``` + + + + + +## LoRA fine-tuning +LoRA fine-tuning is a simple way to fine-tune OmniGen with less GPU memory. To use lora, you should add `--use_lora` and `--lora_rank` to the command. + +```bash +accelerate launch \ + --num_processes=1 \ + train.py \ + --model_name_or_path Shitao/OmniGen-v1 \ + --batch_size_per_device 2 \ + --condition_dropout_prob 0.01 \ + --lr 3e-4 \ + --use_lora \ + --lora_rank 8 \ + --json_file ./toy_data/toy_data.jsonl \ + --image_path ./toy_data/images \ + --max_input_length_limit 18000 \ + --keep_raw_resolution \ + --max_image_size 1024 \ + --gradient_accumulation_steps 1 \ + --ckpt_every 100 \ + --epochs 100 \ + --log_every 1 \ + --results_dir ./results/toy_finetune_lora +``` + +### Inference + +The checkpoint can be found at `{results_dir}/checkpoints/*`. You can use the following command to load checkpoint: +```python +from OmniGen import OmniGenPipeline + +pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1") +pipe.merge_lora("checkpoint_path") # e.g., ./results/toy_finetune_lora/checkpoints/0000100 +``` + + +## A simple example + +Here is an example for learning new concepts: "sks dog". We use five images of one dog from [dog-example](https://huggingface.co/datasets/diffusers/dog-example). + +The json file is `./toy_data/toy_subject_data.jsonl`, and the images have been saved in `./toy_data/images`. + +```bash +accelerate launch \ + --num_processes=1 \ + train.py \ + --model_name_or_path Shitao/OmniGen-v1 \ + --batch_size_per_device 2 \ + --condition_dropout_prob 0.01 \ + --lr 1e-3 \ + --use_lora \ + --lora_rank 8 \ + --json_file ./toy_data/toy_subject_data.jsonl \ + --image_path ./toy_data/images \ + --max_input_length_limit 18000 \ + --keep_raw_resolution \ + --max_image_size 1024 \ + --gradient_accumulation_steps 1 \ + --ckpt_every 100 \ + --epochs 200 \ + --log_every 1 \ + --results_dir ./results/toy_finetune_lora +``` + +After training, you can use the following command to generate images: +```python +from OmniGen import OmniGenPipeline + +pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1") +pipe.merge_lora("checkpoint_path") # e.g., ./results/toy_finetune_lora/checkpoints/0000200 + +images = pipe( + prompt="a photo of sks dog running in the snow", + height=1024, + width=1024, + guidance_scale=3 +) +images[0].save("example_sks_dog_snow.png") +``` diff --git a/docs/inference.md b/docs/inference.md new file mode 100644 index 0000000000000000000000000000000000000000..58b80fd4e5114f884a5968622689aa41ab3ac8d5 --- /dev/null +++ b/docs/inference.md @@ -0,0 +1,96 @@ +# Inference with OmniGen + +To handle some complex tasks, image generation models are becoming increasingly sophisticated, leading to more and more cumbersome workflows. Existing image generation models like SD and Flux require loading many additional network modules (such as ControlNet, IP-Adapter, Reference-Net) and extra preprocessing steps (e.g., face detection, pose detection, image cropping) to generate a satisfactory image. This complex workflow is not user-friendly. We believe that future image generation models should be simpler, generating various images directly through instructions, similar to how GPT works in language generation. + +Therefore, we propose OmniGen, a model capable of handling various image generation tasks within a single framework. The goal of OmniGen is to complete various image generation tasks without relying on any additional components or image preprocessing steps. OmniGen supports tasks including text-to-image generation, image editing, subject-driven image generation, and classical vision tasks, among others. More capabilities can be found in our examples. We provide inference code so you can explore more unknown functionalities yourself. + + + +## Install +```bash +git clone https://github.com/staoxiao/OmniGen.git +cd OmniGen +pip install -e . +``` + + + +## Generate Images +You can use the following code to generate images: +```python +from OmniGen import OmniGenPipeline + +pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1") + +# Text to Image +images = pipe( + prompt="A curly-haired man in a red shirt is drinking tea.", + height=1024, + width=1024, + guidance_scale=2.5, + seed=0, +) +images[0].save("example_t2i.png") # save output PIL Image + +# Multi-modal to Image +# In prompt, we use the placeholder to represent the image. The image placeholder should be in the format of <|image_*|> +# You can add multiple images in the input_images. Please ensure that each image has its placeholder. For example, for the list input_images [img1_path, img2_path], the prompt needs to have two placeholders: <|image_1|>, <|image_2|>. +images = pipe( + prompt="A man in a black shirt is reading a book. The man is the right man in <|image_1|>." + input_images=["./imgs/test_cases/two_man.jpg"] + height=1024, + width=1024, + separate_cfg_infer=False, # if OOM, you can set separate_cfg_infer=True + guidance_scale=2.5, + img_guidance_scale=1.6 +) +images[0].save("example_ti2i.png") # save output PIL image +``` + +Some important arguments: +- `guidance_scale`: The strength of the guidance. Based on our experience, it is usually best to set it between 2 and 3. The higher the value, the more similar the generated image will be to the prompt. If the image appears oversaturated, please reduce the scale. +- `height` and `width`: The height and width of the generated image. The default value is 1024x1024. OmniGen support any size, but these number must be divisible by 16. +- `num_inference_steps`: The number of steps to take in the diffusion process. The higher the value, the more detailed the generated image will be. +- `separate_cfg_infer`: Whether to use separate inference process for CFG guidance. If set to True, memory cost will be lower but the generation speed will be slower. Default is False. +- `use_kv_cache`: Whether to use key-value cache. Default is True. +- `seed`: The seed for random number generator. + +**More examples please refer to [inference.ipynb](../inference.ipynb)** + + +#### Input data +OmniGen can accept multi-modal input data. Specifically, you should pass two arguments: `prompt` and `input_images`. +For text to image generation, you can pass a string as `prompt`, or pass a list of strings as `prompt` to generate multiple images. + +For multi-modal to image generation, you should pass a string as `prompt`, and a list of image paths as `input_images`. The placeholder in the prompt should be in the format of `<|image_*|>`. +For example, if you want to generate an image with a person holding a bouquet of flowers, you can pass the following prompt: +``` +prompt = "A woman holds a bouquet of flowers and faces the camera. Thw woman is <|image_1|>." +input_images = ["./imgs/test_cases/liuyifei.png"] +``` +The placeholder `<|image_1|>` will be replaced by the image at `input_images[0]`, i.e., `./imgs/test_cases/liuyifei.png`. + +If you want to generate multiple images, you can pass a list of prompts and a list of image paths. For example: +``` +prompt = ["A woman holds a bouquet of flowers and faces the camera.", "A woman holds a bouquet of flowers and faces the camera. Thw woman is <|image_1|>."] +input_images = [[], ["./imgs/test_cases/liuyifei.png"]] +``` + + +#### Gradio Demo +We have constructed a online demo in [Huggingface](https://huggingface.co/spaces/Shitao/OmniGen). + +For the local gradio demo, you can run with the following command: +```python +python app.py +``` + + +## Tips +- OOM issue: If you encounter OOM issue, you can try to set `separate_cfg_infer=True`. This will reduce the memory usage but increase the generation latecy. You also can reduce the size of the image, e.g., `height=768, width=512`. +- Oversaturated: If the image appears oversaturated, please reduce the `guidance_scale`. +- Not match the prompt: If the image does not match the prompt, please try to increase the `guidance_scale`. +- Low-quality: More detailed prompt will lead to better results. Besides, larger size of the image (`height` and `width`) will also help. +- Animate Style: If the genereate images is in animate style, you can try to add `photo` to the prompt`. +- Edit generated image. If you generate a image by omnigen and then want to edit it, you cannot use the same seed to edit this image. For example, use seed=0 to generate image, and should use seed=1 to edit this image. +- For image editing tasks, we recommend placing the image before the editing instruction. For example, use `<|image_1|> remove suit`, rather than `remove suit <|image_1|>`. diff --git a/imgs/.DS_Store b/imgs/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..5008ddfcf53c02e82d7eee2e57c38e5672ef89f6 Binary files /dev/null and b/imgs/.DS_Store differ diff --git a/imgs/demo_cases.png b/imgs/demo_cases.png new file mode 100644 index 0000000000000000000000000000000000000000..ade55ef17d226244176febdaf15ec62b48ee0a20 --- /dev/null +++ b/imgs/demo_cases.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0517c97c947f8226f0f39b4ca2ac61b058e52faa59ec5085668062d0162dd21e +size 3416173 diff --git a/imgs/demo_cases/AI_Pioneers.jpg b/imgs/demo_cases/AI_Pioneers.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e77c07754b9e3a06b845b0a909e54348d71d9d41 Binary files /dev/null and b/imgs/demo_cases/AI_Pioneers.jpg differ diff --git a/imgs/demo_cases/edit.png b/imgs/demo_cases/edit.png new file mode 100644 index 0000000000000000000000000000000000000000..d3eded88294e81a8030fe427008ec6f739588a6c --- /dev/null +++ b/imgs/demo_cases/edit.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a83fc3b2ab185a93cb10d207a8776f3a04dc187739d87816cfb33f52d46af502 +size 1239640 diff --git a/imgs/demo_cases/entity.png b/imgs/demo_cases/entity.png new file mode 100644 index 0000000000000000000000000000000000000000..397dc628dc94b2557a94550e78e1470274330f71 --- /dev/null +++ b/imgs/demo_cases/entity.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5e18387fa43989515fd18dcb4ce8edeab0e32aa539d6c14ce374cb5790d8f64b +size 1284368 diff --git a/imgs/demo_cases/reasoning.png b/imgs/demo_cases/reasoning.png new file mode 100644 index 0000000000000000000000000000000000000000..cf842eecfbfb25ce2a242a8037f0ed2c9f814cd8 --- /dev/null +++ b/imgs/demo_cases/reasoning.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:eb510edcb5628c0def3871cef2e0351acc578a1ceef445ebbd72f8b6eb92fc9d +size 1243263 diff --git a/imgs/demo_cases/same_pose.png b/imgs/demo_cases/same_pose.png new file mode 100644 index 0000000000000000000000000000000000000000..2be56eedd6d401d122a260bb51357c45febc1a19 --- /dev/null +++ b/imgs/demo_cases/same_pose.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:beccbeabfc408f319661d9af1063005cbc21c977ba50b910491611ca3babd876 +size 1358837 diff --git a/imgs/demo_cases/skeletal.png b/imgs/demo_cases/skeletal.png new file mode 100644 index 0000000000000000000000000000000000000000..145fe152d3e08d3a4a237531c3f99bd72655e696 Binary files /dev/null and b/imgs/demo_cases/skeletal.png differ diff --git a/imgs/demo_cases/skeletal2img.png b/imgs/demo_cases/skeletal2img.png new file mode 100644 index 0000000000000000000000000000000000000000..b5524b2388b6b65810262a7e8a8154c4d0f74713 --- /dev/null +++ b/imgs/demo_cases/skeletal2img.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:86c21341018bb633f364d40afbf361b5e5690bf1e6539b99150e4aea0ed695b6 +size 1282673 diff --git a/imgs/demo_cases/t2i_woman_with_book.png b/imgs/demo_cases/t2i_woman_with_book.png new file mode 100644 index 0000000000000000000000000000000000000000..1d962e046ddef5c325d6dd5c81cb4bc4f58cf578 --- /dev/null +++ b/imgs/demo_cases/t2i_woman_with_book.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:624ae749478b4ced358c6482385fd35271cbfe25eea0581d2a323bffebde8b39 +size 1247523 diff --git a/imgs/overall.jpg b/imgs/overall.jpg new file mode 100644 index 0000000000000000000000000000000000000000..bd93b233deb28790a95bd3dd02568b85df0d15b6 --- /dev/null +++ b/imgs/overall.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ffa229632ac0bb248eee87cf823a0dc18c22c0a81a57d4c639e7fb1986d4e029 +size 1127850 diff --git a/imgs/referring.png b/imgs/referring.png new file mode 100644 index 0000000000000000000000000000000000000000..feee122bc0c64c8faeade4d7450fd0abd3123c45 --- /dev/null +++ b/imgs/referring.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a626b45c09ca3fa0ff78149bc63f0dc6bc1153125ca083ab21ae2dea9ad798cf +size 4087187 diff --git a/imgs/test_cases/1.jpg b/imgs/test_cases/1.jpg new file mode 100644 index 0000000000000000000000000000000000000000..87b8abe7751eabb04c0a76a82d9f32b5d8cdec0d --- /dev/null +++ b/imgs/test_cases/1.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d2dad7a81a5c609d136fbcccc2a71007c20474103d301ae5564fa63258b4a492 +size 1866247 diff --git a/imgs/test_cases/2.jpg b/imgs/test_cases/2.jpg new file mode 100644 index 0000000000000000000000000000000000000000..db416ca5bfa76b500f485eff7e4b61a37b1aefbd --- /dev/null +++ b/imgs/test_cases/2.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:919ec1a20515ce921d04a5a0f6dcbe5aa4288f41c04cc62a0bd59103957b45db +size 1149154 diff --git a/imgs/test_cases/3.jpg b/imgs/test_cases/3.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0d0bed469cd8db5300c9a8592d4ffc429c2cdaaf --- /dev/null +++ b/imgs/test_cases/3.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c8fef6b304efc3fc189991ec28b83bbe15c391af55b2bfd85276eb19d49194c9 +size 1202687 diff --git a/imgs/test_cases/4.jpg b/imgs/test_cases/4.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ca284d262eb85d115e0e0d7216af3cfc0762d68e --- /dev/null +++ b/imgs/test_cases/4.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:222e844198656a13facbf0f0afe327b074641a7f20d4120418fa1302e61db538 +size 1742551 diff --git a/imgs/test_cases/Amanda.jpg b/imgs/test_cases/Amanda.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ebcde3b59bfab39c9176305494b53c84281fc1f2 --- /dev/null +++ b/imgs/test_cases/Amanda.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c20a508b8619fca4d963f574bca51c7460f274218507c97c2853fa6eaea6d0cb +size 1654477 diff --git a/imgs/test_cases/control.jpg b/imgs/test_cases/control.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1c331c2c3e03d8f8abdd868a3d1c56a5ad623e09 --- /dev/null +++ b/imgs/test_cases/control.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5ca485995cb5f4b1b792e39a99e9647745291f92689eb40f1da925f19dfdc1b5 +size 1315895 diff --git a/imgs/test_cases/icl1.jpg b/imgs/test_cases/icl1.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fc7e56de341b8f60ba27670cd98633c0751aed37 Binary files /dev/null and b/imgs/test_cases/icl1.jpg differ diff --git a/imgs/test_cases/icl2.jpg b/imgs/test_cases/icl2.jpg new file mode 100644 index 0000000000000000000000000000000000000000..818bc93959fd666f27cae9eae3b473f7ed54367d Binary files /dev/null and b/imgs/test_cases/icl2.jpg differ diff --git a/imgs/test_cases/icl3.jpg b/imgs/test_cases/icl3.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a1d2ceaf079684962f0ba5f93f4201054126e464 Binary files /dev/null and b/imgs/test_cases/icl3.jpg differ diff --git a/imgs/test_cases/lecun.png b/imgs/test_cases/lecun.png new file mode 100644 index 0000000000000000000000000000000000000000..ca377e1d927272f017c169cba1e4b20440761bb9 Binary files /dev/null and b/imgs/test_cases/lecun.png differ diff --git a/imgs/test_cases/mckenna.jpg b/imgs/test_cases/mckenna.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6d56457a7fec4aad9e70984d048cf145ff7c2380 --- /dev/null +++ b/imgs/test_cases/mckenna.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bd20a5841f84114859e46c4000d9b8035a40378b5d40fbb2b559864813cd402f +size 1781438 diff --git a/imgs/test_cases/pose.png b/imgs/test_cases/pose.png new file mode 100644 index 0000000000000000000000000000000000000000..1805e1fad27afc6c343602cb40d30c6c3b8e9d14 Binary files /dev/null and b/imgs/test_cases/pose.png differ diff --git a/imgs/test_cases/rose.jpg b/imgs/test_cases/rose.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5cbd8fa77cae238ced7eb7f6679f5a83600e8b85 Binary files /dev/null and b/imgs/test_cases/rose.jpg differ diff --git a/imgs/test_cases/trump.png b/imgs/test_cases/trump.png new file mode 100644 index 0000000000000000000000000000000000000000..d230f3f87ab93e3e445860916474f017261ef986 Binary files /dev/null and b/imgs/test_cases/trump.png differ diff --git a/imgs/test_cases/turing.png b/imgs/test_cases/turing.png new file mode 100644 index 0000000000000000000000000000000000000000..e77c07754b9e3a06b845b0a909e54348d71d9d41 Binary files /dev/null and b/imgs/test_cases/turing.png differ diff --git a/imgs/test_cases/two_man.jpg b/imgs/test_cases/two_man.jpg new file mode 100644 index 0000000000000000000000000000000000000000..927aa91a42a5349c4e55e3ab9657577f142cca5a --- /dev/null +++ b/imgs/test_cases/two_man.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7c940253b06b1b32d472375b445474571bd7838a01fcd069b2ae34fb3e5d8d1f +size 1276640 diff --git a/imgs/test_cases/vase.jpg b/imgs/test_cases/vase.jpg new file mode 100644 index 0000000000000000000000000000000000000000..92f4c64ac3bc00c8efe396a597d6ef5cd9f5d509 Binary files /dev/null and b/imgs/test_cases/vase.jpg differ diff --git a/imgs/test_cases/watch.jpg b/imgs/test_cases/watch.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d7512871848a86ffecee5ccd73cd3d034b2c8cfd Binary files /dev/null and b/imgs/test_cases/watch.jpg differ diff --git a/imgs/test_cases/woman.png b/imgs/test_cases/woman.png new file mode 100644 index 0000000000000000000000000000000000000000..bc7b7f853385b5f9aa2140867b4c32916d48ae8d --- /dev/null +++ b/imgs/test_cases/woman.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:911f596d617d89bde63ecb8aaf1fde71d1fa183d4ac0cdac770a9fc8627a3b50 +size 1475294 diff --git a/imgs/test_cases/yifei2.png b/imgs/test_cases/yifei2.png new file mode 100644 index 0000000000000000000000000000000000000000..8ea709f615ba8cc914d2a9ccddf40fbb91b1c2e8 Binary files /dev/null and b/imgs/test_cases/yifei2.png differ diff --git a/imgs/test_cases/young_musk.jpg b/imgs/test_cases/young_musk.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1dc358245715b66218952cdebad3eebc32f6a944 Binary files /dev/null and b/imgs/test_cases/young_musk.jpg differ diff --git a/imgs/test_cases/young_trump.jpeg b/imgs/test_cases/young_trump.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..7b1d0f3fab08e960089692e6012fa73dbfcfb906 Binary files /dev/null and b/imgs/test_cases/young_trump.jpeg differ diff --git a/imgs/test_cases/zhang.png b/imgs/test_cases/zhang.png new file mode 100644 index 0000000000000000000000000000000000000000..4bda3956f6b7c742392e2b97cfcfed8d6c963233 Binary files /dev/null and b/imgs/test_cases/zhang.png differ diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..e9fb9e28374592d8427dae3b173bea458166ad63 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,12 @@ +--extra-index-url https://download.pytorch.org/whl/cu121 +torch==2.3.1+cu121 +torchvision==0.18.1+cu121 +transformers==4.45.2 +datasets==2.20.0 +accelerate==0.26.1 +diffusers==0.30.3 +timm==0.9.16 +peft==0.9.0 +safetensors==0.4.5 +gradio==5.4.0 +numpy==1.26.3 \ No newline at end of file diff --git a/toy_data/images/2.png b/toy_data/images/2.png new file mode 100644 index 0000000000000000000000000000000000000000..6997e98acfb412c67c19522d2f9866041e3f3332 Binary files /dev/null and b/toy_data/images/2.png differ diff --git a/toy_data/images/3.png b/toy_data/images/3.png new file mode 100644 index 0000000000000000000000000000000000000000..5c8fc62cce4134b0c2693e43822b2c602314f1ff Binary files /dev/null and b/toy_data/images/3.png differ diff --git a/toy_data/images/cat.png b/toy_data/images/cat.png new file mode 100644 index 0000000000000000000000000000000000000000..6d0fd20574bd8a102f5da32aabe4e880f753af59 --- /dev/null +++ b/toy_data/images/cat.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:94dfd913047ea5efe4b34bd413253f2d0c9b5a0f3afc00e932fe6572b455d77f +size 9312246 diff --git a/toy_data/images/clothes.png b/toy_data/images/clothes.png new file mode 100644 index 0000000000000000000000000000000000000000..38403de201a48957582222eb72f2fb5f4e5de3ed Binary files /dev/null and b/toy_data/images/clothes.png differ diff --git a/toy_data/images/dog1.jpeg b/toy_data/images/dog1.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..e839e5ece93db467631a9bf740358a9239ac6a5e Binary files /dev/null and b/toy_data/images/dog1.jpeg differ diff --git a/toy_data/images/dog2.jpeg b/toy_data/images/dog2.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..e48372be0defe9d68d9f29617e043cdc72e3393e --- /dev/null +++ b/toy_data/images/dog2.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9d8013d9efa2edb356e0f88c66de044f71247a99cab52b1628e753c2a08bb602 +size 1186464 diff --git a/toy_data/images/dog3.jpeg b/toy_data/images/dog3.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..ad02062741e5243f22ebc47602bb0a7828a86cc6 --- /dev/null +++ b/toy_data/images/dog3.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5c9805758a8f8950a35df820f3bfc32b3c6ca2a0e0e214a7978ea147a233bd54 +size 1167042 diff --git a/toy_data/images/dog4.jpeg b/toy_data/images/dog4.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..ad02062741e5243f22ebc47602bb0a7828a86cc6 --- /dev/null +++ b/toy_data/images/dog4.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5c9805758a8f8950a35df820f3bfc32b3c6ca2a0e0e214a7978ea147a233bd54 +size 1167042 diff --git a/toy_data/images/dog5.jpeg b/toy_data/images/dog5.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..bd97aa8fc4311f9dd268b63c528f6f8e6dcc99cb --- /dev/null +++ b/toy_data/images/dog5.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a65d3a853b7c65dd4d394cb6b209f77666351d2bae7c6670c5677d8eb5981644 +size 1163467 diff --git a/toy_data/images/edit_source_1.png b/toy_data/images/edit_source_1.png new file mode 100644 index 0000000000000000000000000000000000000000..191fadb634c23444288d892fb04eaa7a7ae0ba8b Binary files /dev/null and b/toy_data/images/edit_source_1.png differ diff --git a/toy_data/images/edit_target_1.png b/toy_data/images/edit_target_1.png new file mode 100644 index 0000000000000000000000000000000000000000..ffde9946310d4330d4ca3e67f543e6d1eaaf9c99 Binary files /dev/null and b/toy_data/images/edit_target_1.png differ diff --git a/toy_data/images/human_pose.png b/toy_data/images/human_pose.png new file mode 100644 index 0000000000000000000000000000000000000000..8e5ecdb10e028badaa40e1ba172c1e4365e7286e Binary files /dev/null and b/toy_data/images/human_pose.png differ diff --git a/toy_data/images/model.png b/toy_data/images/model.png new file mode 100644 index 0000000000000000000000000000000000000000..d23e110423f6250e3bebe93bf9b9d9106f4c1b39 Binary files /dev/null and b/toy_data/images/model.png differ diff --git a/toy_data/images/pose.png b/toy_data/images/pose.png new file mode 100644 index 0000000000000000000000000000000000000000..61a1ec210f491a8c2aba85b3e8eb62e065432ceb Binary files /dev/null and b/toy_data/images/pose.png differ diff --git a/toy_data/images/seg_input.png b/toy_data/images/seg_input.png new file mode 100644 index 0000000000000000000000000000000000000000..a3dc7dae8f2c90aebb3397f21209ee6d094ba53b Binary files /dev/null and b/toy_data/images/seg_input.png differ diff --git a/toy_data/images/seg_output.png b/toy_data/images/seg_output.png new file mode 100644 index 0000000000000000000000000000000000000000..df04c554c51d75d76338f636b15c1020b28ef2b0 Binary files /dev/null and b/toy_data/images/seg_output.png differ diff --git a/toy_data/images/subject_source_1.png b/toy_data/images/subject_source_1.png new file mode 100644 index 0000000000000000000000000000000000000000..6293a8816e102dae18376c79def2a4ce95808ed6 Binary files /dev/null and b/toy_data/images/subject_source_1.png differ diff --git a/toy_data/images/try_on.png b/toy_data/images/try_on.png new file mode 100644 index 0000000000000000000000000000000000000000..00e3f0147d24f299234c2108491a3b2da7c96417 Binary files /dev/null and b/toy_data/images/try_on.png differ diff --git a/toy_data/images/walking.png b/toy_data/images/walking.png new file mode 100644 index 0000000000000000000000000000000000000000..22895ddff064aa5b79fb1e83868b94e200a776fb --- /dev/null +++ b/toy_data/images/walking.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c618081c9cab71a29d78487491fa8b5986218f695cdccc7bd4b4d1529b3774b8 +size 4074422 diff --git a/toy_data/toy_data.jsonl b/toy_data/toy_data.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..9e1930efd1792686a39a01066772936cd6c363b1 --- /dev/null +++ b/toy_data/toy_data.jsonl @@ -0,0 +1,11 @@ +{"task_type":"text_to_iamge","instruction":"A white cat resting on a picnic table.","input_images":[],"output_image":"cat.png"} +{"task_type":"text_to_iamge","instruction":"a person walking on a suspension bridge.","input_images":[],"output_image":"walking.png"} +{"task_type":"image_edit","instruction":"<|image_1|> The umbrella should be red.","input_images":["edit_source_1.png"],"output_image":"edit_target_1.png"} +{"task_type":"segementation","instruction":"Find lamp in the picture <|image_1|> and color them blue.","input_images":["seg_input.png"],"output_image":"seg_output.png"} +{"task_type":"try-on","instruction":"<|image_1|> wears <|image_2|>.","input_images":["model.png","clothes.png"],"output_image":"try_on.png"} +{"task_type":"pose", "instruction": "Detect the skeleton of human in <|image_1|>", "input_images": ["human_pose.png"], "output_image": "pose.png"} +{"task_type":"text_to_iamge","instruction":"A white cat resting on a picnic table.","input_images":[],"output_image":"cat.png"} +{"task_type":"text_to_iamge","instruction":"a person walking on a suspension bridge.","input_images":[],"output_image":"walking.png"} +{"task_type":"image_edit","instruction":"<|image_1|> The umbrella should be red.","input_images":["edit_source_1.png"],"output_image":"edit_target_1.png"} +{"task_type":"segementation","instruction":"Find lamp in the picture <|image_1|> and color them blue.","input_images":["seg_input.png"],"output_image":"seg_output.png"} +{"task_type":"try-on","instruction":"<|image_1|> wears <|image_2|>.","input_images":["model.png","clothes.png"],"output_image":"try_on.png"} \ No newline at end of file diff --git a/toy_data/toy_subject_data.jsonl b/toy_data/toy_subject_data.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..9e7168002f300ff2c6e836354ba6bf8bc20ec3b0 --- /dev/null +++ b/toy_data/toy_subject_data.jsonl @@ -0,0 +1,5 @@ +{"task_type":"text_to_iamge","instruction":"a photo of sks dog","input_images":[],"output_image":"dog1.jpeg"} +{"task_type":"text_to_iamge","instruction":"sks dog","input_images":[],"output_image":"dog2.jpeg"} +{"task_type":"text_to_iamge","instruction":"a photo of sks dog. The background is orange.","input_images":[],"output_image":"dog3.jpeg"} +{"task_type":"text_to_iamge","instruction":"a photo of sks dog","input_images":[],"output_image":"dog4.jpeg"} +{"task_type":"text_to_iamge","instruction":"a photo of sks dog","input_images":[],"output_image":"dog5.jpeg"} \ No newline at end of file