Quantization scripts

#1
by WaveCut - opened

Could you share the scripts you used to quantize both transformer and text_encoder2, as i want to reproduce it using different merged flux checkpoint.

Thanks in advance!

This comment has been hidden

This is the fastest code I have tried so far. 30 seconds to generate 1024x1024 on a RTX 3080. That's faster than SDXL and many times better quality. Pretty amazing really. I think @HighCWu has something here. Could probably use some way to add LORAs. GGUF support would be really awesome.

Could you share the scripts you used to quantize both transformer and text_encoder2, as i want to reproduce it using different merged flux checkpoint.

Thanks in advance!

You can load any transformer with this, just re-use @HighCWu 's text_encoder_2:

from diffusers import FluxPipeline
flux = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=None,
    text_encoder_2=None,
    torch_dtype=torch.bfloat16,
)

from model import T5EncoderModel as T5EncoderModel #better to run the non-quantized version of this if you can
text_encoder_2: T5EncoderModel = T5EncoderModel.from_pretrained(
    "HighCWu/FLUX.1-dev-4bit",
    subfolder="text_encoder_2",
     torch_dtype=torch.bfloat16,
)
flux.text_encoder_2 = text_encoder_2

model_id = "your other flux model" # <---------------- any flux model 
from model import FluxTransformer2DModel as FluxTransformer2DModel #HighCWu's transformer class
transformer: FluxTransformer2DModel = FluxTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
)
flux.transformer=transformer
flux.enable_model_cpu_offload()

ok I got it working, thanks for the hints

  1. you need to convert the fp8 model to diffusers format https://github.com/huggingface/diffusers/blob/main/scripts/convert_flux_to_diffusers.py this may require adding "model.diffusion_model" in each key before mapping, make sure to save it in bf16, i tried fp8 formats, they are not compatible
  2. you need to load it with this codebase, passing quantization_config (BitsAndBytesConfig) to FluxTransformer2DModel.from_pretrained
  3. save the model with .save_pretrained

I went the wrong way and tried quantizing using official bitsandbytes branch https://github.com/huggingface/diffusers/pull/9213 it is bugged and has wrong layer shapes after saving

For civitai models you can do this after you download the file from civitai:

f = FluxPipeline.from_single_file(
    filepath_to_local_file,
    scheduler=None,
    tokenizer=None,
    tokenizer_2=None,
    #transformer=None, #only load the transformer
    text_encoder=None,
    vae=None,
    text_encoder_2=None,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
)

f.save_pretrained("yournewfluxfolder/"+your_model_name)

This will save only the transformer, making a transformer/ subfolder. Then load it by itself using the transformer subfolder just like you do normally.

transformer: FluxTransformer2DModel = FluxTransformer2DModel.from_pretrained(
        "yournewfluxfolder/"+your_model_name,
        subfolder="transformer",
        torch_dtype=torch.bfloat16,
        load_in_4bit=True,
    )

model.transformer=transformer

I found the results are way better if you can run the non-quantized t5_xxl model (text_encoder_2). I was able to do this with my second 10g GPU. Only running the HighCWu with 4bit is still enough to look just as good as the full version in almost all cases. Also only takes like 20 seconds to generate a 1024x1024 image on my RTX 3080. Even the full version of t5_xxl is severely limited. I don't expect much from it (especially for NSFW). Until someone trains a better t5 for this we are stuck with it. I hear its really hard to train certain content and I am pretty sure it's because of the t5.

Another thing I was able to do is img2img using this as a refiner for SDXL models. The latents won't convert but the PIL image will. You can do SDXL for n steps then do 4-8 steps with flux for the final image. I had to use the small HighCWu version of t5 for this though, because img2img takes up more memory than my 3080 can handle. Thing is you don't really need it as much anyway since you're mostly going off the pregenerated image.

from diffusers import FluxImg2ImgPipeline
flux_img2img = FluxImg2ImgPipeline.from_pretrained(
            "black-forest-labs/FLUX.1-schnell",
            text_encoder_2=text_encoder_2_small, #use HighCWu's text_encoder_2 for this
            transformer=transformer, #whatever 4bit transformer you are using
            torch_dtype=torch.bfloat16,
            use_safetensors=True,
        )

@HighCWu @megachad I tried loading the text_encoder_2 (T5EncoderModel) using this piece of code.

text_encoder_2 = T5EncoderModel.from_pretrained(
    "HighCWu/FLUX.1-dev-4bit",
    subfolder="text_encoder_2",
    torch_dtype=torch.bfloat16
)

I get this error while running it in kaggle. This is related to HQQ quant config I guess, I tried writing the config and then passing it to the T5EncoderModel class. Didn't Work!!!
I will share the error here.

UnboundLocalError                         Traceback (most recent call last)
Cell In[7], line 25
     12 transformer_nf4 = FluxTransformer2DModel.from_pretrained(
     13     bfl_repo,
     14     subfolder="transformer",
     15     quantization_config=nf4_config,
     16     torch_dtype=torch.bfloat16
     17 )
     19 # text_encoder_2_fp8 = T5EncoderModel.from_pretrained(
     20 #     bfl_repo,
     21 #     subfolder="text_encoder_2",
     22 #     torch_dtype=torch.bfloat16
     23 # )
---> 25 text_encoder_2 = T5EncoderModel.from_pretrained(
     26     "HighCWu/FLUX.1-dev-4bit",
     27     subfolder="text_encoder_2",
     28     torch_dtype=torch.bfloat16
     29 )

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:3647, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs)
   3645 if pre_quantized or quantization_config is not None:
   3646     if pre_quantized:
-> 3647         config.quantization_config = AutoHfQuantizer.merge_quantization_configs(
   3648             config.quantization_config, quantization_config
   3649         )
   3650     else:
   3651         config.quantization_config = quantization_config

File /opt/conda/lib/python3.10/site-packages/transformers/quantizers/auto.py:173, in AutoHfQuantizer.merge_quantization_configs(cls, quantization_config, quantization_config_from_args)
    170     warning_msg = ""
    172 if isinstance(quantization_config, dict):
--> 173     quantization_config = AutoQuantizationConfig.from_dict(quantization_config)
    175 if (
    176     isinstance(quantization_config, (GPTQConfig, AwqConfig, FbgemmFp8Config))
    177     and quantization_config_from_args is not None
    178 ):
    179     # special case for GPTQ / AWQ / FbgemmFp8 config collision
    180     loading_attr_dict = quantization_config_from_args.get_loading_attributes()

File /opt/conda/lib/python3.10/site-packages/transformers/quantizers/auto.py:103, in AutoQuantizationConfig.from_dict(cls, quantization_config_dict)
     97     raise ValueError(
     98         f"Unknown quantization type, got {quant_method} - supported types are:"
     99         f" {list(AUTO_QUANTIZER_MAPPING.keys())}"
    100     )
    102 target_cls = AUTO_QUANTIZATION_CONFIG_MAPPING[quant_method]
--> 103 return target_cls.from_dict(quantization_config_dict)

File /opt/conda/lib/python3.10/site-packages/transformers/utils/quantization_config.py:269, in HqqConfig.from_dict(cls, config)
    264 @classmethod
    265 def from_dict(cls, config: Dict[str, Any]):
    266     """
    267     Override from_dict, used in AutoQuantizationConfig.from_dict in quantizers/auto.py
    268     """
--> 269     instance = cls()
    270     instance.quant_config = config["quant_config"]
    271     instance.skip_modules = config["skip_modules"]

File /opt/conda/lib/python3.10/site-packages/transformers/utils/quantization_config.py:244, in HqqConfig.__init__(self, nbits, group_size, view_as_float, axis, dynamic_config, skip_modules, **kwargs)
    242         self.quant_config[key] = HQQBaseQuantizeConfig(**dynamic_config[key])
    243 else:
--> 244     self.quant_config = HQQBaseQuantizeConfig(
    245         **{
    246             "nbits": nbits,
    247             "group_size": group_size,
    248             "view_as_float": view_as_float,
    249             "axis": axis,
    250         }
    251     )
    253 self.quant_method = QuantizationMethod.HQQ
    254 self.skip_modules = skip_modules

UnboundLocalError: local variable 'HQQBaseQuantizeConfig' referenced before assignment

Please guide me here. Thanks!!

You have to import highwu's model.py from github...

from model import T5EncoderModel, FluxTransformer2DModel

https://github.com/HighCWu/flux-4bit

not run
never
on colab t4

I did not try the compression codes, I only tried the activation code and it did not work

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

on colab t4
?????????????????

Sign up or log in to comment