File size: 6,894 Bytes
1085369
 
 
4bb9970
1085369
4bb9970
 
1085369
4bb9970
 
 
 
 
 
 
 
1b94efc
4bb9970
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49e83b6
4bb9970
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: mit
language:
  - en
base_model:
  - black-forest-labs/FLUX.1-dev
  - Qwen/Qwen2-VL-7B-Instruct
library_name: diffusers
tags:
  - flux
  - qwen2vl
  - stable-diffusion
  - text-to-image
  - image-to-image
  - controlnet
pipeline_tag: text-to-image
inference: false
---

# Qwen2vl-Flux

<div align="center">
  <img src="landing-1.png" alt="Qwen2vl-Flux Banner" width="100%">
</div>

Qwen2vl-Flux is a state-of-the-art multimodal image generation model that enhances FLUX with Qwen2VL's vision-language understanding capabilities. This model excels at generating high-quality images based on both text prompts and visual references, offering superior multimodal understanding and control.

## Model Architecture

<div align="center">
  <img src="flux-architecture.svg" alt="Flux Architecture" width="800px">
</div>

The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, enabling more precise and context-aware image generation. Key components include:
- Vision-Language Understanding Module (Qwen2VL)
- Enhanced FLUX backbone
- Multi-mode Generation Pipeline
- Structural Control Integration

## Features

- **Enhanced Vision-Language Understanding**: Leverages Qwen2VL for superior multimodal comprehension
- **Multiple Generation Modes**: Supports variation, img2img, inpainting, and controlnet-guided generation
- **Structural Control**: Integrates depth estimation and line detection for precise structural guidance
- **Flexible Attention Mechanism**: Supports focused generation with spatial attention control
- **High-Resolution Output**: Supports various aspect ratios up to 1536x1024

## Generation Examples

### Image Variation
Create diverse variations while maintaining the essence of the original image:

<div align="center">
  <table>
    <tr>
      <td><img src="variation_1.png" alt="Variation Example 1" width="256px"></td>
      <td><img src="variation_2.png" alt="Variation Example 2" width="256px"></td>
      <td><img src="variation_3.png" alt="Variation Example 3" width="256px"></td>
    </tr>
    <tr>
      <td><img src="variation_4.png" alt="Variation Example 4" width="256px"></td>
      <td><img src="variation_5.png" alt="Variation Example 5" width="256px"></td>
    </tr>
  </table>
</div>

### Image Blending
Seamlessly blend multiple images with intelligent style transfer:

<div align="center">
  <table>
    <tr>
      <td><img src="blend_1.png" alt="Blend Example 1" width="256px"></td>
      <td><img src="blend_2.png" alt="Blend Example 2" width="256px"></td>
      <td><img src="blend_3.png" alt="Blend Example 3" width="256px"></td>
    </tr>
    <tr>
      <td><img src="blend_4.png" alt="Blend Example 4" width="256px"></td>
      <td><img src="blend_5.png" alt="Blend Example 5" width="256px"></td>
      <td><img src="blend_6.png" alt="Blend Example 6" width="256px"></td>
    </tr>
    <tr>
      <td><img src="blend_7.png" alt="Blend Example 7" width="256px"></td>
    </tr>
  </table>
</div>

### Text-Guided Image Blending
Control image generation with textual prompts:

<div align="center">
  <table>
    <tr>
      <td><img src="textblend_1.png" alt="Text Blend Example 1" width="256px"></td>
      <td><img src="textblend_2.png" alt="Text Blend Example 2" width="256px"></td>
      <td><img src="textblend_3.png" alt="Text Blend Example 3" width="256px"></td>
    </tr>
    <tr>
      <td><img src="textblend_4.png" alt="Text Blend Example 4" width="256px"></td>
      <td><img src="textblend_5.png" alt="Text Blend Example 5" width="256px"></td>
      <td><img src="textblend_6.png" alt="Text Blend Example 6" width="256px"></td>
    </tr>
    <tr>
      <td><img src="textblend_7.png" alt="Text Blend Example 7" width="256px"></td>
      <td><img src="textblend_8.png" alt="Text Blend Example 8" width="256px"></td>
      <td><img src="textblend_9.png" alt="Text Blend Example 9" width="256px"></td>
    </tr>
  </table>
</div>

### Grid-Based Style Transfer
Apply fine-grained style control with grid attention:

<div align="center">
  <table>
    <tr>
      <td><img src="griddot_1.png" alt="Grid Example 1" width="256px"></td>
      <td><img src="griddot_2.png" alt="Grid Example 2" width="256px"></td>
      <td><img src="griddot_3.png" alt="Grid Example 3" width="256px"></td>
    </tr>
    <tr>
      <td><img src="griddot_4.png" alt="Grid Example 4" width="256px"></td>
      <td><img src="griddot_5.png" alt="Grid Example 5" width="256px"></td>
      <td><img src="griddot_6.png" alt="Grid Example 6" width="256px"></td>
    </tr>
    <tr>
      <td><img src="griddot_7.png" alt="Grid Example 7" width="256px"></td>
      <td><img src="griddot_8.png" alt="Grid Example 8" width="256px"></td>
      <td><img src="griddot_9.png" alt="Grid Example 9" width="256px"></td>
    </tr>
  </table>
</div>

## Usage

The inference code is available via our [GitHub repository](https://github.com/erwold/qwen2vl-flux) which provides comprehensive Python interfaces and examples.

### Installation

1. Clone the repository and install dependencies:
```bash
git clone https://github.com/erwold/qwen2vl-flux
cd qwen2vl-flux
pip install -r requirements.txt
```

2. Download model checkpoints from Hugging Face:
```python
from huggingface_hub import snapshot_download

snapshot_download("Djrango/Qwen2vl-Flux")
```

### Basic Examples

```python
from model import FluxModel

# Initialize model
model = FluxModel(device="cuda")

# Image Variation
outputs = model.generate(
    input_image_a=input_image,
    prompt="Your text prompt",
    mode="variation"
)

# Image Blending
outputs = model.generate(
    input_image_a=source_image,
    input_image_b=reference_image,
    mode="img2img",
    denoise_strength=0.8
)

# Text-Guided Blending
outputs = model.generate(
    input_image_a=input_image,
    prompt="Transform into an oil painting style",
    mode="variation",
    guidance_scale=7.5
)

# Grid-Based Style Transfer
outputs = model.generate(
    input_image_a=content_image,
    input_image_b=style_image,
    mode="controlnet",
    line_mode=True,
    depth_mode=True
)
```

## Technical Specifications

- **Framework**: PyTorch 2.4.1+
- **Base Models**: 
  - FLUX.1-dev
  - Qwen2-VL-7B-Instruct
- **Memory Requirements**: 48GB+ VRAM
- **Supported Image Sizes**:
  - 1024x1024 (1:1)
  - 1344x768 (16:9)
  - 768x1344 (9:16)
  - 1536x640 (2.4:1)
  - 896x1152 (3:4)
  - 1152x896 (4:3)


## Citation

```bibtex
@misc{erwold-2024-qwen2vl-flux,
      title={Qwen2VL-Flux: Unifying Image and Text Guidance for Controllable Image Generation}, 
      author={Pengqi Lu},
      year={2024},
      url={https://github.com/erwold/qwen2vl-flux}
}
```

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.

## Acknowledgments

- Based on the FLUX architecture
- Integrates Qwen2VL for vision-language understanding
- Thanks to the open-source communities of FLUX and Qwen