mart9992
/

nervn

Inference Endpoints

Model card Files Files and versions Community

nervn / playground /ImageBind_SAM /README.md

mart9992's picture

m

b793f0c 10 months ago

|

3.07 kB

	## ImageBind with SAM

	This is an experimental demo aims to combine [ImageBind](https://github.com/facebookresearch/ImageBind) and [SAM](https://github.com/facebookresearch/segment-anything) to generate mask with different modalities.

	This basic idea is followed with [IEA: Image Editing Anything](https://github.com/feizc/IEA) and [CLIP-SAM](https://github.com/maxi-w/CLIP-SAM) which generate the referring mask with the following steps:

	- Step 1: Generate auto masks with `SamAutomaticMaskGenerator`
	- Step 2: Crop all the box region from the masks
	- Step 3: Compute the similarity with cropped images and different modalities
	- Step 4: Merge the highest similarity mask region

	## Table of contents
	- [Installation](#installation)
	- [ImageBind-SAM Demo](#run-the-demo)
	- [Audio Referring Segment](#run-audio-referring-segment-demo)
	- [Text Referring Segment](#run-text-referring-segment-demo)
	- [Image Referring Segment](#run-image-referring-segmentation-demo)



	## Installation
	- Download the pretrained checkpoints

	```bash
	cd playground/ImageBind_SAM

	mkdir .checkpoints
	cd .checkpoints

	# download imagebind weights
	wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth
	wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
	```

	- Install ImageBind follow the [official installation guidance](https://github.com/facebookresearch/ImageBind#usage).
	- Install Grounded-SAM follow [install Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything#installation).


	## Run the demo
	```bash
	python demo.py
	```

	We implement `Text Seg` and `Audio Seg` in this demo, the generate masks will be saved as `text_sam_merged_mask.jpg` and `audio_sam_merged_mask.jpg`:

	<div align="center">

	\| Input Model \| Modality \| Generate Mask \|
	\|:----:\|:----:\|:----:\|
	\| ![](./.assets/car_image.jpg) \| [car audio](./.assets/car_audio.wav) \| ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/imagebind_sam/audio_sam_merged_mask_new.jpg?raw=true) \|
	\| ![](./.assets/car_image.jpg) \| "A car" \| ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/imagebind_sam/text_sam_merged_mask.jpg?raw=true) \|
	\| ![](./.assets/car_image.jpg) \| <div style="text-align: center"> <img src="https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/imagebind_sam/referring_car_image.jpg?raw=true" width=55%></div> \| ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/imagebind_sam/image_referring_sam_merged_mask.jpg?raw=true) \|


	</div>

	By setting different threshold may influence a lot on the final results.

	## Run image referring segmentation demo
	```bash
	# download the referring image
	cd .assets
	wget https://github.com/IDEA-Research/detrex-storage/releases/download/grounded-sam-storage/referring_car_image.jpg
	cd ..

	python image_referring_seg_demo.py
	```

	## Run audio referring segmentation demo
	```bash
	python audio_referring_seg_demo.py
	```

	## Run text referring segmentation demo
	```bash
	python text_referring_seg_demo.py
	```