playground/ImageBind_SAM/README.md · mart9992/nervn at 3df311a17499bce7ced95f1abc162ef78459028d

ImageBind with SAM

This is an experimental demo aims to combine ImageBind and SAM to generate mask with different modalities.

This basic idea is followed with IEA: Image Editing Anything and CLIP-SAM which generate the referring mask with the following steps:

Step 1: Generate auto masks with SamAutomaticMaskGenerator
Step 2: Crop all the box region from the masks
Step 3: Compute the similarity with cropped images and different modalities
Step 4: Merge the highest similarity mask region

Installation
ImageBind-SAM Demo
Audio Referring Segment
Text Referring Segment
Image Referring Segment

Installation

Download the pretrained checkpoints

cd playground/ImageBind_SAM

mkdir .checkpoints
cd .checkpoints

# download imagebind weights
wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Install ImageBind follow the official installation guidance.
Install Grounded-SAM follow install Grounded-SAM.

Run the demo

python demo.py

We implement Text Seg and Audio Seg in this demo, the generate masks will be saved as text_sam_merged_mask.jpg and audio_sam_merged_mask.jpg:

Input Model	Modality	Generate Mask
	car audio
	"A car"

By setting different threshold may influence a lot on the final results.

Run image referring segmentation demo

# download the referring image
cd .assets
wget https://github.com/IDEA-Research/detrex-storage/releases/download/grounded-sam-storage/referring_car_image.jpg
cd ..

python image_referring_seg_demo.py

Run audio referring segmentation demo

python audio_referring_seg_demo.py

Run text referring segmentation demo

python text_referring_seg_demo.py

mart9992
/

nervn