mart9992's picture
m
b793f0c
|
raw
history blame
3.07 kB

ImageBind with SAM

This is an experimental demo aims to combine ImageBind and SAM to generate mask with different modalities.

This basic idea is followed with IEA: Image Editing Anything and CLIP-SAM which generate the referring mask with the following steps:

  • Step 1: Generate auto masks with SamAutomaticMaskGenerator
  • Step 2: Crop all the box region from the masks
  • Step 3: Compute the similarity with cropped images and different modalities
  • Step 4: Merge the highest similarity mask region

Table of contents

Installation

  • Download the pretrained checkpoints
cd playground/ImageBind_SAM

mkdir .checkpoints
cd .checkpoints

# download imagebind weights
wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Run the demo

python demo.py

We implement Text Seg and Audio Seg in this demo, the generate masks will be saved as text_sam_merged_mask.jpg and audio_sam_merged_mask.jpg:

Input Model Modality Generate Mask
car audio
"A car"

By setting different threshold may influence a lot on the final results.

Run image referring segmentation demo

# download the referring image
cd .assets
wget https://github.com/IDEA-Research/detrex-storage/releases/download/grounded-sam-storage/referring_car_image.jpg
cd ..

python image_referring_seg_demo.py

Run audio referring segmentation demo

python audio_referring_seg_demo.py

Run text referring segmentation demo

python text_referring_seg_demo.py