GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing
Ruizhe Ou1 · Yuan Hu2,* · Fan Zhang2 · Jiaxin Chen1 · Yu Liu2,3
1Beijing University of Posts and Telecommunications · 2Peking University · 3Peking University Ordos Research Institute of Energy *corresponding authors
GeoPix is a new state-of-the-art pixel-level multi-modal large language model in remote sensing domain, supporting referring image segmentation and other tasks.
Release
- [2025.02.20] We release the pre-trained checkpoints, inference code and gradio demo! Github
- [2025.01.12] We release the Paper.
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing [Arxiv]
Abstract
In this work, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM’s segmentation token embeddings. For more details, please refer to the paper.
Download
You can directly download the model from Huggingface, ModelScope or OpenXLab. You also can download the model in python script:
# Huggingface
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Norman-ou/GeoPix-ft-sior_rsicap", local_dir="./pretrained_models")
# ModelScope
from modelscope import snapshot_download
model_dir = snapshot_download("NormanOU/GeoPix-ft-sior_rsicap", local_dir="./pretrained_models")
Once you have prepared all models, the folder tree should be like:
.
├── ...
├── model
├── pretrained_models
├── app.py
├── engine.py
├── ...
└── README.md
Citation
@misc{ou2025geopixmultimodallargelanguage,
title={GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing},
author={Ruizhe Ou and Yuan Hu and Fan Zhang and Jiaxin Chen and Yu Liu},
year={2025},
eprint={2501.06828},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.06828},
}
- Downloads last month
- 25