cuda out of memory for flux1-dev-fp8.safetensors

#30
by pritam-tam - opened

I have nvidia telsa T4 GPU. I have downloaded fp8 safetensors and flux-dev model locally. this the code for model

bfl_repo = "app/utilities/flux_model"
dtype = torch.bfloat16

transformer = FluxTransformer2DModel.from_single_file(
    "app/utilities/flux_model/flux1-dev-fp8.safetensors", 
    torch_dtype=dtype
).to("cuda")

pipe = FluxPipeline.from_pretrained(
     "app/utilities/flux_model",
     transformer=transformer,  
     torch_dtype=dtype
)
pipe.enable_sequential_cpu_offload() 

When I run the code I am getting this error

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 15.57 GiB of 
which 57.38 MiB is free. Including non-PyTorch memory, this process has 15.51 GiB memory in use. Of the allocated memory 
15.40 GiB is allocated by PyTorch, and 16.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory 
is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  
See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

That code is casting it back to bf16, the dtype should be torch.float8_e4m3fn and not torch.bfloat16

I have cuda 12.6 and my pytorch is

torch = "^2.4.1"
torchvision = "^0.19.1"
torchaudio = "^2.4.1"

I am getting this error

TypeError: couldn't find storage object Float8_e4m3fnStorage

import torch
from diffusers import FluxPipeline, FluxTransformer2DModel

dtype = torch.float8_e4m3fn

transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors",
dtype=dtype).to("cuda")
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",
transformer=transformer,
token="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",
torch_dtype=dtype
).to("cuda")

crashes with
OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 50.81 MiB is free. Process 591371 has 39.51 GiB memory in use. Of the allocated memory 39.10 GiB is allocated by PyTorch, and 5.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Sign up or log in to comment