Image resolutions that will work well?
Thank you for all the hardwork that went into creating this model and providing it to the community!
The model card could be improved by making it clear what resolutions your model supports/will perform well with/was trained on. This is the most basic information for a vision LLM: what inputs will work (well) with it? For some reason almost everyone releasing vision LLMs makes this very hard to figure out.
I'm guessing it is like the your 2.0, up to 12 tiles of 448x448 pixels? Some things that weren't clear to me with that were:
-What if one of the dimensions of your image isn't divisible by 448?
-What if your image would require more than 12 tiles?
-If inputs violating those constraints aren't outright rejected, what happens? (e.g. do the tiles overlap/ is the image is resized or cropped) Is the model trained on such images?
Thanks again!
Thank you for your kind words and valuable feedback! We appreciate your suggestion to clarify supported resolutions in the model card. Here's the detailed information:
- If one of the dimensions of your image isn't divisible by 448, the image will be resized to the nearest dimensions divisible by 448, which might introduce some slight distortion.
- You can control the resolution and tiling behavior using the
max_num
parameter. By default, we setmax_num=12
, but you can adjust this to 18 or 24 tiles to process higher-resolution images. - If an input violates these constraints (e.g., exceeds the maximum number of tiles), the model may resize or crop the image to fit within the supported tiling limits. The model has been trained on such cases to ensure robust performance.
Additionally, you can refer to the dynamic_preprocess
function in the README for more details on how preprocessing is handled dynamically.