phi3 vision visual encoder

#2
by the-future-dev - opened

Was a paper published about this vision model?
Which visual encoder was used?

looks like a clip encoder.

Microsoft org
β€’
edited May 21

We use CLIP-L, the paper will be released later today.

What is the resolution of the image input?

Microsoft org

The resolution is dynamic based on the input image aspect ratio. The max resolution is 1344x1344.

We use CLIP-L, the paper will be released later today.

where is the parper, please show links

We use CLIP-L, the paper will be released later today.

Does the visual model freeze during training?

We use CLIP-L, the paper will be released later today.

Are you going to release the paper and the fine-tuning code ?

'img_processor': {'image_dim_out': 1024, 'model_name': 'openai/clip-vit-large-patch14-336', 'name': 'clip_vision_model', 'num_img_tokens': 144}

Please share the Paper URL here or model card.

Microsoft org

The updated Phi-3 Technical Report is available at https://arxiv.org/pdf/2404.14219

nguyenbh changed discussion status to closed

Sign up or log in to comment