medsam-vit-base / README.md
nielsr's picture
nielsr HF staff
Create README.md
1ed1aa5
metadata
license: apache-2.0

Model Card for Segment Anything Model (SAM) - ViT Base (ViT-B) version, fine-tuned for medical image segmentation

Model architecture Detailed architecture of Segment Anything Model (SAM).

Table of Contents

  1. TL;DR
  2. Model Details
  3. Usage
  4. Citation

TL;DR

Link to original SAM repository Link to original MedSAM repository

Snow Forest Mountains

The Segment Anything Model (SAM) produces high-quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks. The abstract of the paper states:

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.

Disclaimer: Content from this model card has been written by the Hugging Face team, and parts of it were copy pasted from the original SAM model card.

Model Details

The SAM model is made up of 3 modules:

  • The VisionEncoder: a VIT based image encoder. It computes the image embeddings using attention on patches of the image. Relative Positional Embedding is used.
  • The PromptEncoder: generates embeddings for points and bounding boxes
  • The MaskDecoder: a two-ways transformer which performs cross attention between the image embedding and the point embeddings (->) and between the point embeddings and the image embeddings. The outputs are fed
  • The Neck: predicts the output masks based on the contextualized masks produced by the MaskDecoder.

Usage

Refer to the demo notebooks:

  • this one showcasing inference with MedSAM
  • this one showcasing general usage of SAM,

as well as the docs.

Citation

If you use this model, please use the following BibTeX entry.

@article{kirillov2023segany,
  title={Segment Anything},
  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
  journal={arXiv:2304.02643},
  year={2023}
}