File size: 6,894 Bytes
1085369 4bb9970 1085369 4bb9970 1085369 4bb9970 1b94efc 4bb9970 49e83b6 4bb9970 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
---
license: mit
language:
- en
base_model:
- black-forest-labs/FLUX.1-dev
- Qwen/Qwen2-VL-7B-Instruct
library_name: diffusers
tags:
- flux
- qwen2vl
- stable-diffusion
- text-to-image
- image-to-image
- controlnet
pipeline_tag: text-to-image
inference: false
---
# Qwen2vl-Flux
<div align="center">
<img src="landing-1.png" alt="Qwen2vl-Flux Banner" width="100%">
</div>
Qwen2vl-Flux is a state-of-the-art multimodal image generation model that enhances FLUX with Qwen2VL's vision-language understanding capabilities. This model excels at generating high-quality images based on both text prompts and visual references, offering superior multimodal understanding and control.
## Model Architecture
<div align="center">
<img src="flux-architecture.svg" alt="Flux Architecture" width="800px">
</div>
The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, enabling more precise and context-aware image generation. Key components include:
- Vision-Language Understanding Module (Qwen2VL)
- Enhanced FLUX backbone
- Multi-mode Generation Pipeline
- Structural Control Integration
## Features
- **Enhanced Vision-Language Understanding**: Leverages Qwen2VL for superior multimodal comprehension
- **Multiple Generation Modes**: Supports variation, img2img, inpainting, and controlnet-guided generation
- **Structural Control**: Integrates depth estimation and line detection for precise structural guidance
- **Flexible Attention Mechanism**: Supports focused generation with spatial attention control
- **High-Resolution Output**: Supports various aspect ratios up to 1536x1024
## Generation Examples
### Image Variation
Create diverse variations while maintaining the essence of the original image:
<div align="center">
<table>
<tr>
<td><img src="variation_1.png" alt="Variation Example 1" width="256px"></td>
<td><img src="variation_2.png" alt="Variation Example 2" width="256px"></td>
<td><img src="variation_3.png" alt="Variation Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="variation_4.png" alt="Variation Example 4" width="256px"></td>
<td><img src="variation_5.png" alt="Variation Example 5" width="256px"></td>
</tr>
</table>
</div>
### Image Blending
Seamlessly blend multiple images with intelligent style transfer:
<div align="center">
<table>
<tr>
<td><img src="blend_1.png" alt="Blend Example 1" width="256px"></td>
<td><img src="blend_2.png" alt="Blend Example 2" width="256px"></td>
<td><img src="blend_3.png" alt="Blend Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="blend_4.png" alt="Blend Example 4" width="256px"></td>
<td><img src="blend_5.png" alt="Blend Example 5" width="256px"></td>
<td><img src="blend_6.png" alt="Blend Example 6" width="256px"></td>
</tr>
<tr>
<td><img src="blend_7.png" alt="Blend Example 7" width="256px"></td>
</tr>
</table>
</div>
### Text-Guided Image Blending
Control image generation with textual prompts:
<div align="center">
<table>
<tr>
<td><img src="textblend_1.png" alt="Text Blend Example 1" width="256px"></td>
<td><img src="textblend_2.png" alt="Text Blend Example 2" width="256px"></td>
<td><img src="textblend_3.png" alt="Text Blend Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="textblend_4.png" alt="Text Blend Example 4" width="256px"></td>
<td><img src="textblend_5.png" alt="Text Blend Example 5" width="256px"></td>
<td><img src="textblend_6.png" alt="Text Blend Example 6" width="256px"></td>
</tr>
<tr>
<td><img src="textblend_7.png" alt="Text Blend Example 7" width="256px"></td>
<td><img src="textblend_8.png" alt="Text Blend Example 8" width="256px"></td>
<td><img src="textblend_9.png" alt="Text Blend Example 9" width="256px"></td>
</tr>
</table>
</div>
### Grid-Based Style Transfer
Apply fine-grained style control with grid attention:
<div align="center">
<table>
<tr>
<td><img src="griddot_1.png" alt="Grid Example 1" width="256px"></td>
<td><img src="griddot_2.png" alt="Grid Example 2" width="256px"></td>
<td><img src="griddot_3.png" alt="Grid Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="griddot_4.png" alt="Grid Example 4" width="256px"></td>
<td><img src="griddot_5.png" alt="Grid Example 5" width="256px"></td>
<td><img src="griddot_6.png" alt="Grid Example 6" width="256px"></td>
</tr>
<tr>
<td><img src="griddot_7.png" alt="Grid Example 7" width="256px"></td>
<td><img src="griddot_8.png" alt="Grid Example 8" width="256px"></td>
<td><img src="griddot_9.png" alt="Grid Example 9" width="256px"></td>
</tr>
</table>
</div>
## Usage
The inference code is available via our [GitHub repository](https://github.com/erwold/qwen2vl-flux) which provides comprehensive Python interfaces and examples.
### Installation
1. Clone the repository and install dependencies:
```bash
git clone https://github.com/erwold/qwen2vl-flux
cd qwen2vl-flux
pip install -r requirements.txt
```
2. Download model checkpoints from Hugging Face:
```python
from huggingface_hub import snapshot_download
snapshot_download("Djrango/Qwen2vl-Flux")
```
### Basic Examples
```python
from model import FluxModel
# Initialize model
model = FluxModel(device="cuda")
# Image Variation
outputs = model.generate(
input_image_a=input_image,
prompt="Your text prompt",
mode="variation"
)
# Image Blending
outputs = model.generate(
input_image_a=source_image,
input_image_b=reference_image,
mode="img2img",
denoise_strength=0.8
)
# Text-Guided Blending
outputs = model.generate(
input_image_a=input_image,
prompt="Transform into an oil painting style",
mode="variation",
guidance_scale=7.5
)
# Grid-Based Style Transfer
outputs = model.generate(
input_image_a=content_image,
input_image_b=style_image,
mode="controlnet",
line_mode=True,
depth_mode=True
)
```
## Technical Specifications
- **Framework**: PyTorch 2.4.1+
- **Base Models**:
- FLUX.1-dev
- Qwen2-VL-7B-Instruct
- **Memory Requirements**: 48GB+ VRAM
- **Supported Image Sizes**:
- 1024x1024 (1:1)
- 1344x768 (16:9)
- 768x1344 (9:16)
- 1536x640 (2.4:1)
- 896x1152 (3:4)
- 1152x896 (4:3)
## Citation
```bibtex
@misc{erwold-2024-qwen2vl-flux,
title={Qwen2VL-Flux: Unifying Image and Text Guidance for Controllable Image Generation},
author={Pengqi Lu},
year={2024},
url={https://github.com/erwold/qwen2vl-flux}
}
```
## License
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
## Acknowledgments
- Based on the FLUX architecture
- Integrates Qwen2VL for vision-language understanding
- Thanks to the open-source communities of FLUX and Qwen
|