Efficient inference

#10
by cuongnguyenxuan - opened

I fine-tuned Florence-2-base on my task. I can inference this fine-tuned model by both CPU, GPU without flash_attn. Both case took me more than 3GB to run? is this normal and can i reduce memory when inference it? By the way, when fine-tuned large version, i need more than 20 GB to fine-tuned it.

Sign up or log in to comment