Spaces:

Y-T-G
/

Blur-Anything

Running

App Files Files Community

github-actions[bot] commited on Aug 11, 2023

Commit

123489f

•

0 Parent(s):

Sync to HuggingFace Spaces

Browse files

Files changed (39) hide show

.gitattributes +35 -0
.github/workflows/main.yml +25 -0
.gitignore +4 -0
CHANGELOG.md +12 -0
LICENSE +21 -0
README.md +42 -0
app.py +880 -0
data/sample-1.mp4 +3 -0
data/sample-2.mp4 +3 -0
export_onnx_model.py +201 -0
packages.txt +1 -0
poetry.lock +0 -0
pyproject.toml +31 -0
requirements.txt +106 -0
track_anything.py +88 -0
tracker/base_tracker.py +142 -0
tracker/config/config.yaml +15 -0
tracker/inference/__init__.py +0 -0
tracker/inference/inference_core.py +149 -0
tracker/inference/kv_memory_store.py +234 -0
tracker/inference/memory_manager.py +373 -0
tracker/model/__init__.py +0 -0
tracker/model/aggregate.py +16 -0
tracker/model/cbam.py +119 -0
tracker/model/group_modules.py +92 -0
tracker/model/losses.py +76 -0
tracker/model/memory_util.py +87 -0
tracker/model/modules.py +261 -0
tracker/model/network.py +241 -0
tracker/model/resnet.py +191 -0
tracker/model/trainer.py +302 -0
tracker/util/__init__.py +0 -0
tracker/util/mask_mapper.py +87 -0
tracker/util/range_transform.py +12 -0
tracker/util/tensor_util.py +50 -0
utils/base_segmenter.py +149 -0
utils/blur.py +81 -0
utils/interact_tools.py +109 -0
utils/painter.py +360 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text

.github/workflows/main.yml ADDED Viewed

	@@ -0,0 +1,25 @@

+on:
+  push:
+    branches:
+      - main
+jobs:
+  huggingface-sync:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout Repository
+        uses: actions/checkout@v3
+      - name: Hugging Face Sync
+        uses: JacobLinCool/huggingface-sync@v1
+        with:
+            user: Y-T-G
+            space: Blur-Anything
+            emoji: 💻
+            token:  ${{ secrets.HF_TOKEN }}
+            github: ${{ secrets.GITHUB_TOKEN }}
+            colorFrom: yellow
+            colorTo: pino
+            sdk: gradio
+            app_file: app.py
+            pinned: false
+            license: mit

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+checkpoints/*
+output/*
+notebook.ipynb
+*.pyc

CHANGELOG.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# Changelog
+## v0.2.0 - 2023-08-11
+### MobileSAM
+- Added quantized ONNX MobileSAM model. Pass `--sam_model_type vit_t` to use it.
+## v0.1.0 - 2023-05-06
+### Blur-Anything Initial Release
+- Added blur implementation
+- Using pims instead of storing frames in memory for better memory usage

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Mohammed Yasin
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+title: Blur Anything
+emoji: 💻
+colorFrom: yellow
+colorTo: pino
+sdk: gradio
+app_file: app.py
+pinned: false
+---
+# Blur Anything For Videos
+Blur Anything is an adaptation of the excellent [Track Anything](https://github.com/gaomingqi/Track-Anything) project which is in turn based on Meta's Segment Anything and XMem. It allows you to blur anything in a video, including faces, license plates, etc.
+<div>
+<a src="https://img.shields.io/badge/%F0%9F%A4%97-Open_in_Spaces-informational.svg?style=flat-square" href="https://huggingface.co/spaces/Y-T-G/Blur-Anything">
+<img src="https://img.shields.io/badge/%F0%9F%A4%97-Open_in_Spaces-informational.svg?style=flat-square">
+</a>
+</div>
+## Get Started
+```shell
+# Clone the repository:
+git clone https://github.com/Y-T-G/Blur-Anything.git
+cd Blur-Anything
+# Install dependencies:
+pip install -r requirements.txt
+# Run the Blur-Anything gradio demo.
+python app.py --device cuda:0
+# python app.py --device cuda:0 --sam_model_type vit_b # for lower memory usage
+```
+## To Do
+- [x] Add a gradio demo
+- [ ] Add support to use YouTube video URL
+- [ ] Add option to completely black out the object
+## Acknowledgements
+The project is an adaptation of [Track Anything](https://github.com/gaomingqi/Track-Anything) which is based on [Segment Anything](https://github.com/facebookresearch/segment-anything) and [XMem](https://github.com/hkchengrex/XMem).

app.py ADDED Viewed

	@@ -0,0 +1,880 @@

+import os
+import time
+import requests
+import sys
+import json
+import gradio as gr
+import numpy as np
+import torch
+import torchvision
+import pims
+from export_onnx_model import run_export
+from onnxruntime.quantization import QuantType
+from onnxruntime.quantization.quantize import quantize_dynamic
+sys.path.append(sys.path[0] + "/tracker")
+sys.path.append(sys.path[0] + "/tracker/model")
+from track_anything import TrackingAnything
+from track_anything import parse_augment
+from utils.painter import mask_painter
+from utils.blur import blur_frames_and_write
+# download checkpoints
+def download_checkpoint(url, folder, filename):
+    os.makedirs(folder, exist_ok=True)
+    filepath = os.path.join(folder, filename)
+    if not os.path.exists(filepath):
+        print("Downloading checkpoints...")
+        response = requests.get(url, stream=True)
+        with open(filepath, "wb") as f:
+            for chunk in response.iter_content(chunk_size=8192):
+                if chunk:
+                    f.write(chunk)
+        print("Download successful.")
+    return filepath
+# convert points input to prompt state
+def get_prompt(click_state, click_input):
+    inputs = json.loads(click_input)
+    points = click_state[0]
+    labels = click_state[1]
+    for input in inputs:
+        points.append(input[:2])
+        labels.append(input[2])
+    click_state[0] = points
+    click_state[1] = labels
+    prompt = {
+        "prompt_type": ["click"],
+        "input_point": click_state[0],
+        "input_label": click_state[1],
+        "multimask_output": "False",
+    }
+    return prompt
+# extract frames from upload video
+def get_frames_from_video(video_input, video_state):
+    """
+    Args:
+        video_path:str
+        timestamp:float64
+    Return
+        [[0:nearest_frame], [nearest_frame:], nearest_frame]
+    """
+    video_path = video_input
+    frames = []
+    user_name = time.time()
+    operation_log = [
+        ("", ""),
+        (
+            "Video uploaded. Click the image for adding targets to track and blur.",
+            "Normal",
+        ),
+    ]
+    try:
+        frames = pims.Video(video_path)
+        fps = frames.frame_rate
+        image_size = (frames.shape[1], frames.shape[2])
+    except (OSError, TypeError, ValueError, KeyError, SyntaxError) as e:
+        print("read_frame_source:{} error. {}\n".format(video_path, str(e)))
+    # initialize video_state
+    video_state = {
+        "user_name": user_name,
+        "video_name": os.path.split(video_path)[-1],
+        "origin_images": frames,
+        "painted_images": [0] * len(frames),
+        "masks": [0] * len(frames),
+        "logits": [None] * len(frames),
+        "select_frame_number": 0,
+        "fps": fps,
+    }
+    video_info = "Video Name: {}, FPS: {}, Total Frames: {}, Image Size:{}".format(
+        video_state["video_name"], video_state["fps"], len(frames), image_size
+    )
+    model.samcontroler.sam_controler.reset_image()
+    model.samcontroler.sam_controler.set_image(video_state["origin_images"][0])
+    return (
+        video_state,
+        video_info,
+        video_state["origin_images"][0],
+        gr.update(visible=True, maximum=len(frames), value=1),
+        gr.update(visible=True, maximum=len(frames), value=len(frames)),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True),
+        gr.update(visible=True, value=operation_log),
+    )
+def run_example(example):
+    return video_input
+# get the select frame from gradio slider
+def select_template(image_selection_slider, video_state, interactive_state):
+    # images = video_state[1]
+    image_selection_slider -= 1
+    video_state["select_frame_number"] = image_selection_slider
+    # once select a new template frame, set the image in sam
+    model.samcontroler.sam_controler.reset_image()
+    model.samcontroler.sam_controler.set_image(
+        video_state["origin_images"][image_selection_slider]
+    )
+    # update the masks when select a new template frame
+    operation_log = [
+        ("", ""),
+        (
+            "Select frame {}. Try click image and add mask for tracking.".format(
+                image_selection_slider
+            ),
+            "Normal",
+        ),
+    ]
+    return (
+        video_state["painted_images"][image_selection_slider],
+        video_state,
+        interactive_state,
+        operation_log,
+    )
+# set the tracking end frame
+def set_end_number(track_pause_number_slider, video_state, interactive_state):
+    interactive_state["track_end_number"] = track_pause_number_slider
+    operation_log = [
+        ("", ""),
+        (
+            "Set the tracking finish at frame {}".format(track_pause_number_slider),
+            "Normal",
+        ),
+    ]
+    return (
+        interactive_state,
+        operation_log,
+    )
+def get_resize_ratio(resize_ratio_slider, interactive_state):
+    interactive_state["resize_ratio"] = resize_ratio_slider
+    return interactive_state
+def get_blur_strength(blur_strength_slider, interactive_state):
+    interactive_state["blur_strength"] = blur_strength_slider
+    return interactive_state
+# use sam to get the mask
+def sam_refine(
+    video_state, point_prompt, click_state, interactive_state, evt: gr.SelectData
+):
+    """
+    Args:
+        template_frame: PIL.Image
+        point_prompt: flag for positive or negative button click
+        click_state: [[points], [labels]]
+    """
+    if point_prompt == "Positive":
+        coordinate = "[[{},{},1]]".format(evt.index[0], evt.index[1])
+        interactive_state["positive_click_times"] += 1
+    else:
+        coordinate = "[[{},{},0]]".format(evt.index[0], evt.index[1])
+        interactive_state["negative_click_times"] += 1
+    # prompt for sam model
+    model.samcontroler.sam_controler.reset_image()
+    model.samcontroler.sam_controler.set_image(
+        video_state["origin_images"][video_state["select_frame_number"]]
+    )
+    prompt = get_prompt(click_state=click_state, click_input=coordinate)
+    mask, logit, painted_image = model.first_frame_click(
+        image=video_state["origin_images"][video_state["select_frame_number"]],
+        points=np.array(prompt["input_point"]),
+        labels=np.array(prompt["input_label"]),
+        multimask=prompt["multimask_output"],
+    )
+    video_state["masks"][video_state["select_frame_number"]] = mask
+    video_state["logits"][video_state["select_frame_number"]] = logit
+    video_state["painted_images"][video_state["select_frame_number"]] = painted_image
+    operation_log = [
+        ("", ""),
+        (
+            "Use SAM for segment. You can try add positive and negative points by clicking. Or press Clear clicks button to refresh the image. Press Add mask button when you are satisfied with the segment",
+            "Normal",
+        ),
+    ]
+    return painted_image, video_state, interactive_state, operation_log
+def add_multi_mask(video_state, interactive_state, mask_dropdown):
+    try:
+        mask = video_state["masks"][video_state["select_frame_number"]]
+        interactive_state["multi_mask"]["masks"].append(mask)
+        interactive_state["multi_mask"]["mask_names"].append(
+            "mask_{:03d}".format(len(interactive_state["multi_mask"]["masks"]))
+        )
+        mask_dropdown.append(
+            "mask_{:03d}".format(len(interactive_state["multi_mask"]["masks"]))
+        )
+        select_frame, run_status = show_mask(
+            video_state, interactive_state, mask_dropdown
+        )
+        operation_log = [
+            ("", ""),
+            (
+                "Added a mask, use the mask select for target tracking or blurring.",
+                "Normal",
+            ),
+        ]
+    except Exception:
+        operation_log = [
+            ("Please click the left image to generate mask.", "Error"),
+            ("", ""),
+        ]
+    return (
+        interactive_state,
+        gr.update(
+            choices=interactive_state["multi_mask"]["mask_names"], value=mask_dropdown
+        ),
+        select_frame,
+        [[], []],
+        operation_log,
+    )
+def clear_click(video_state, click_state):
+    click_state = [[], []]
+    template_frame = video_state["origin_images"][video_state["select_frame_number"]]
+    operation_log = [
+        ("", ""),
+        ("Clear points history and refresh the image.", "Normal"),
+    ]
+    return template_frame, click_state, operation_log
+def remove_multi_mask(interactive_state, mask_dropdown):
+    interactive_state["multi_mask"]["mask_names"] = []
+    interactive_state["multi_mask"]["masks"] = []
+    operation_log = [("", ""), ("Remove all mask, please add new masks", "Normal")]
+    return interactive_state, gr.update(choices=[], value=[]), operation_log
+def show_mask(video_state, interactive_state, mask_dropdown):
+    mask_dropdown.sort()
+    select_frame = video_state["origin_images"][video_state["select_frame_number"]]
+    for i in range(len(mask_dropdown)):
+        mask_number = int(mask_dropdown[i].split("_")[1]) - 1
+        mask = interactive_state["multi_mask"]["masks"][mask_number]
+        select_frame = mask_painter(
+            select_frame, mask.astype("uint8"), mask_color=mask_number + 2
+        )
+    operation_log = [
+        ("", ""),
+        ("Select {} for tracking or blurring".format(mask_dropdown), "Normal"),
+    ]
+    return select_frame, operation_log
+# tracking vos
+def vos_tracking_video(video_state, interactive_state, mask_dropdown):
+    operation_log = [
+        ("", ""),
+        (
+            "Track the selected masks, and then you can select the masks for blurring.",
+            "Normal",
+        ),
+    ]
+    model.xmem.clear_memory()
+    if interactive_state["track_end_number"]:
+        following_frames = video_state["origin_images"][
+            video_state["select_frame_number"]: interactive_state["track_end_number"]
+        ]
+    else:
+        following_frames = video_state["origin_images"][
+            video_state["select_frame_number"]:
+        ]
+    if interactive_state["multi_mask"]["masks"]:
+        if len(mask_dropdown) == 0:
+            mask_dropdown = ["mask_001"]
+        mask_dropdown.sort()
+        template_mask = interactive_state["multi_mask"]["masks"][
+            int(mask_dropdown[0].split("_")[1]) - 1
+        ] * (int(mask_dropdown[0].split("_")[1]))
+        for i in range(1, len(mask_dropdown)):
+            mask_number = int(mask_dropdown[i].split("_")[1]) - 1
+            template_mask = np.clip(
+                template_mask
+                + interactive_state["multi_mask"]["masks"][mask_number]
+                * (mask_number + 1),
+                0,
+                mask_number + 1,
+            )
+        video_state["masks"][video_state["select_frame_number"]] = template_mask
+    else:
+        template_mask = video_state["masks"][video_state["select_frame_number"]]
+    # operation error
+    if len(np.unique(template_mask)) == 1:
+        template_mask[0][0] = 1
+        operation_log = [
+            (
+                "Error! Please add at least one mask to track by clicking the left image.",
+                "Error",
+            ),
+            ("", ""),
+        ]
+        # return video_output, video_state, interactive_state, operation_error
+    output_path = "./output/track/{}".format(video_state["video_name"])
+    fps = video_state["fps"]
+    masks, logits, painted_images = model.generator(
+        images=following_frames, template_mask=template_mask, write=True, fps=fps,  output_path=output_path
+    )
+    # clear GPU memory
+    model.xmem.clear_memory()
+    if interactive_state["track_end_number"]:
+        video_state["masks"][
+            video_state["select_frame_number"]: interactive_state["track_end_number"]
+        ] = masks
+        video_state["logits"][
+            video_state["select_frame_number"]: interactive_state["track_end_number"]
+        ] = logits
+        video_state["painted_images"][
+            video_state["select_frame_number"]: interactive_state["track_end_number"]
+        ] = painted_images
+    else:
+        video_state["masks"][video_state["select_frame_number"]:] = masks
+        video_state["logits"][video_state["select_frame_number"]:] = logits
+        video_state["painted_images"][
+            video_state["select_frame_number"]:
+        ] = painted_images
+    interactive_state["inference_times"] += 1
+    print(
+        "For generating this tracking result, inference times: {}, click times: {}, positive: {}, negative: {}".format(
+            interactive_state["inference_times"],
+            interactive_state["positive_click_times"]
+            + interactive_state["negative_click_times"],
+            interactive_state["positive_click_times"],
+            interactive_state["negative_click_times"],
+        )
+    )
+    return output_path, video_state, interactive_state, operation_log
+def blur_video(video_state, interactive_state, mask_dropdown):
+    operation_log = [("", ""), ("Removed the selected masks.", "Normal")]
+    frames = np.asarray(video_state["origin_images"])[
+        video_state["select_frame_number"]:interactive_state["track_end_number"]
+    ]
+    fps = video_state["fps"]
+    output_path = "./output/blur/{}".format(video_state["video_name"])
+    blur_masks = np.asarray(video_state["masks"][video_state["select_frame_number"]:interactive_state["track_end_number"]])
+    if len(mask_dropdown) == 0:
+        mask_dropdown = ["mask_001"]
+    mask_dropdown.sort()
+    # convert mask_dropdown to mask numbers
+    blur_mask_numbers = [
+        int(mask_dropdown[i].split("_")[1]) for i in range(len(mask_dropdown))
+    ]
+    # interate through all masks and remove the masks that are not in mask_dropdown
+    unique_masks = np.unique(blur_masks)
+    num_masks = len(unique_masks) - 1
+    for i in range(1, num_masks + 1):
+        if i in blur_mask_numbers:
+            continue
+        blur_masks[blur_masks == i] = 0
+    # blur video
+    try:
+        blur_frames_and_write(
+            frames,
+            blur_masks,
+            ratio=interactive_state["resize_ratio"],
+            strength=interactive_state["blur_strength"],
+            fps=fps,
+            output_path=output_path
+        )
+    except Exception as e:
+        print("Exception ", e)
+        operation_log = [
+            (
+                "Error! You are trying to blur without masks input. Please track the selected mask first, and then press blur. To speed up, please use the resize ratio to scale down the image size.",
+                "Error",
+            ),
+            ("", ""),
+        ]
+    return output_path, video_state, interactive_state, operation_log
+# generate video after vos inference
+def generate_video_from_frames(frames, output_path, fps=30):
+    """
+    Generates a video from a list of frames.
+    Args:
+        frames (list of numpy arrays): The frames to include in the video.
+        output_path (str): The path to save the generated video.
+        fps (int, optional): The frame rate of the output video. Defaults to 30.
+    """
+    frames = torch.from_numpy(np.asarray(frames))
+    if not os.path.exists(os.path.dirname(output_path)):
+        os.makedirs(os.path.dirname(output_path))
+    torchvision.io.write_video(output_path, frames, fps=fps, video_codec="libx264")
+    return output_path
+# convert to onnx quantized model
+def convert_to_onnx(args, checkpoint, quantized=True):
+    """
+    Convert the model to onnx format.
+    Args:
+        model (nn.Module): The model to convert.
+        output_path (str): The path to save the onnx model.
+        input_shape (tuple): The input shape of the model.
+        quantized (bool, optional): Whether to quantize the model. Defaults to True.
+    """
+    onnx_output_path = f"{checkpoint.split('.')[-2]}.onnx"
+    quant_output_path = f"{checkpoint.split('.')[-2]}_quant.onnx"
+    print("Converting to ONNX quantized model...")
+    if not (os.path.exists(onnx_output_path)):
+        run_export(
+            model_type=args.sam_model_type,
+            checkpoint=checkpoint,
+            opset=16,
+            output=onnx_output_path,
+            return_single_mask=True
+        )
+    if quantized and not (os.path.exists(quant_output_path)):
+        quantize_dynamic(
+            model_input=onnx_output_path,
+            model_output=quant_output_path,
+            optimize_model=True,
+            per_channel=False,
+            reduce_range=False,
+            weight_type=QuantType.QUInt8,
+        )
+    return quant_output_path if quantized else onnx_output_path
+# args, defined in track_anything.py
+args = parse_augment()
+# check and download checkpoints if needed
+SAM_checkpoint_dict = {
+    "vit_h": "sam_vit_h_4b8939.pth",
+    "vit_l": "sam_vit_l_0b3195.pth",
+    "vit_b": "sam_vit_b_01ec64.pth",
+    "vit_t": "mobile_sam.pt",
+}
+SAM_checkpoint_url_dict = {
+    "vit_h": "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth",
+    "vit_l": "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth",
+    "vit_b": "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth",
+    "vit_t": "https://github.com/ChaoningZhang/MobileSAM/raw/master/weights/mobile_sam.pt",
+}
+sam_checkpoint = SAM_checkpoint_dict[args.sam_model_type]
+sam_checkpoint_url = SAM_checkpoint_url_dict[args.sam_model_type]
+xmem_checkpoint = "XMem-s012.pth"
+xmem_checkpoint_url = (
+    "https://github.com/hkchengrex/XMem/releases/download/v1.0/XMem-s012.pth"
+)
+# initialize SAM, XMem
+folder = "checkpoints"
+sam_pt_checkpoint = download_checkpoint(sam_checkpoint_url, folder, sam_checkpoint)
+xmem_checkpoint = download_checkpoint(xmem_checkpoint_url, folder, xmem_checkpoint)
+if args.sam_model_type == "vit_t":
+    sam_onnx_checkpoint = convert_to_onnx(args, sam_pt_checkpoint, quantized=True)
+else:
+    sam_onnx_checkpoint = ""
+model = TrackingAnything(sam_pt_checkpoint, sam_onnx_checkpoint, xmem_checkpoint, args)
+title = """<p><h1 align="center">Blur-Anything</h1></p>
+    """
+description = """<p>Gradio demo for Blur Anything, a flexible and interactive
+              tool for video object tracking, segmentation, and blurring. To
+              use it, simply upload your video, or click one of the examples to
+              load them. Code: <a
+              href="https://github.com/Y-T-G/Blur-Anything">https://github.com/Y-T-G/Blur-Anything</a>
+              <a
+              href="https://huggingface.co/spaces/Y-T-G/Blur-Anything?duplicate=true"><img
+              style="display: inline; margin-top: 0em; margin-bottom: 0em"
+              src="https://bit.ly/3gLdBN6" alt="Duplicate Space" /></a></p>"""
+with gr.Blocks() as iface:
+    """
+    state for
+    """
+    click_state = gr.State([[], []])
+    interactive_state = gr.State(
+        {
+            "inference_times": 0,
+            "negative_click_times": 0,
+            "positive_click_times": 0,
+            "mask_save": args.mask_save,
+            "multi_mask": {"mask_names": [], "masks": []},
+            "track_end_number": None,
+            "resize_ratio": 1,
+            "blur_strength": 3,
+        }
+    )
+    video_state = gr.State(
+        {
+            "user_name": "",
+            "video_name": "",
+            "origin_images": None,
+            "painted_images": None,
+            "masks": None,
+            "blur_masks": None,
+            "logits": None,
+            "select_frame_number": 0,
+            "fps": 30,
+        }
+    )
+    gr.Markdown(title)
+    gr.Markdown(description)
+    with gr.Row():
+        # for user video input
+        with gr.Column():
+            with gr.Row():
+                video_input = gr.Video()
+                with gr.Column():
+                    video_info = gr.Textbox(label="Video Info")
+                    resize_info = gr.Textbox(
+                        value="You can use the resize ratio slider to scale down the original image to around 360P resolution for faster processing.",
+                        label="Tips for running this demo.",
+                    )
+                    resize_ratio_slider = gr.Slider(
+                        minimum=0.02,
+                        maximum=1,
+                        step=0.02,
+                        value=1,
+                        label="Resize ratio",
+                        visible=True,
+                    )
+            with gr.Row():
+                # put the template frame under the radio button
+                with gr.Column():
+                    # extract frames
+                    with gr.Column():
+                        extract_frames_button = gr.Button(
+                            value="Get video info", interactive=True, variant="primary"
+                        )
+                    # click points settins, negative or positive, mode continuous or single
+                    with gr.Row():
+                        with gr.Row():
+                            point_prompt = gr.Radio(
+                                choices=["Positive", "Negative"],
+                                value="Positive",
+                                label="Point Prompt",
+                                interactive=True,
+                                visible=False,
+                            )
+                            remove_mask_button = gr.Button(
+                                value="Remove mask", interactive=True, visible=False
+                            )
+                            clear_button_click = gr.Button(
+                                value="Clear Clicks", interactive=True, visible=False
+                            )
+                            Add_mask_button = gr.Button(
+                                value="Add mask", interactive=True, visible=False
+                            )
+                    template_frame = gr.Image(
+                        type="pil",
+                        interactive=True,
+                        elem_id="template_frame",
+                        visible=False,
+                    )
+                    image_selection_slider = gr.Slider(
+                        minimum=1,
+                        maximum=100,
+                        step=1,
+                        value=1,
+                        label="Image Selection",
+                        visible=False,
+                    )
+                    track_pause_number_slider = gr.Slider(
+                        minimum=1,
+                        maximum=100,
+                        step=1,
+                        value=1,
+                        label="Track end frames",
+                        visible=False,
+                    )
+                with gr.Column():
+                    run_status = gr.HighlightedText(
+                        value=[
+                            ("Text", "Error"),
+                            ("to be", "Label 2"),
+                            ("highlighted", "Label 3"),
+                        ],
+                        visible=False,
+                    )
+                    mask_dropdown = gr.Dropdown(
+                        multiselect=True,
+                        value=[],
+                        label="Mask selection",
+                        info=".",
+                        visible=False,
+                    )
+                    video_output = gr.Video(visible=False)
+                    with gr.Row():
+                        tracking_video_predict_button = gr.Button(
+                            value="Tracking", visible=False
+                        )
+                        blur_video_predict_button = gr.Button(
+                            value="Blur", visible=False
+                        )
+                    with gr.Row():
+                        blur_strength_slider = gr.Slider(
+                            minimum=3,
+                            maximum=15,
+                            step=2,
+                            value=3,
+                            label="Blur Strength",
+                            visible=False,
+                        )
+    # first step: get the video information
+    extract_frames_button.click(
+        fn=get_frames_from_video,
+        inputs=[video_input, video_state],
+        outputs=[
+            video_state,
+            video_info,
+            template_frame,
+            image_selection_slider,
+            track_pause_number_slider,
+            point_prompt,
+            clear_button_click,
+            Add_mask_button,
+            template_frame,
+            tracking_video_predict_button,
+            video_output,
+            mask_dropdown,
+            remove_mask_button,
+            blur_video_predict_button,
+            blur_strength_slider,
+            run_status,
+        ],
+    )
+    # second step: select images from slider
+    image_selection_slider.release(
+        fn=select_template,
+        inputs=[image_selection_slider, video_state, interactive_state],
+        outputs=[template_frame, video_state, interactive_state, run_status],
+        api_name="select_image",
+    )
+    track_pause_number_slider.release(
+        fn=set_end_number,
+        inputs=[track_pause_number_slider, video_state, interactive_state],
+        outputs=[interactive_state, run_status],
+        api_name="end_image",
+    )
+    resize_ratio_slider.release(
+        fn=get_resize_ratio,
+        inputs=[resize_ratio_slider, interactive_state],
+        outputs=[interactive_state],
+        api_name="resize_ratio",
+    )
+    blur_strength_slider.release(
+        fn=get_blur_strength,
+        inputs=[blur_strength_slider, interactive_state],
+        outputs=[interactive_state],
+        api_name="blur_strength",
+    )
+    # click select image to get mask using sam
+    template_frame.select(
+        fn=sam_refine,
+        inputs=[video_state, point_prompt, click_state, interactive_state],
+        outputs=[template_frame, video_state, interactive_state, run_status],
+    )
+    # add different mask
+    Add_mask_button.click(
+        fn=add_multi_mask,
+        inputs=[video_state, interactive_state, mask_dropdown],
+        outputs=[
+            interactive_state,
+            mask_dropdown,
+            template_frame,
+            click_state,
+            run_status,
+        ],
+    )
+    remove_mask_button.click(
+        fn=remove_multi_mask,
+        inputs=[interactive_state, mask_dropdown],
+        outputs=[interactive_state, mask_dropdown, run_status],
+    )
+    # tracking video from select image and mask
+    tracking_video_predict_button.click(
+        fn=vos_tracking_video,
+        inputs=[video_state, interactive_state, mask_dropdown],
+        outputs=[video_output, video_state, interactive_state, run_status],
+    )
+    # tracking video from select image and mask
+    blur_video_predict_button.click(
+        fn=blur_video,
+        inputs=[video_state, interactive_state, mask_dropdown],
+        outputs=[video_output, video_state, interactive_state, run_status],
+    )
+    # click to get mask
+    mask_dropdown.change(
+        fn=show_mask,
+        inputs=[video_state, interactive_state, mask_dropdown],
+        outputs=[template_frame, run_status],
+    )
+    # clear input
+    video_input.clear(
+        lambda: (
+            {
+                "user_name": "",
+                "video_name": "",
+                "origin_images": None,
+                "painted_images": None,
+                "masks": None,
+                "blur_masks": None,
+                "logits": None,
+                "select_frame_number": 0,
+                "fps": 30,
+            },
+            {
+                "inference_times": 0,
+                "negative_click_times": 0,
+                "positive_click_times": 0,
+                "mask_save": args.mask_save,
+                "multi_mask": {"mask_names": [], "masks": []},
+                "track_end_number": 0,
+                "resize_ratio": 1,
+                "blur_strength": 3,
+            },
+            [[], []],
+            None,
+            None,
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False, value=[]),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False),
+        ),
+        [],
+        [
+            video_state,
+            interactive_state,
+            click_state,
+            video_output,
+            template_frame,
+            tracking_video_predict_button,
+            image_selection_slider,
+            track_pause_number_slider,
+            point_prompt,
+            clear_button_click,
+            Add_mask_button,
+            template_frame,
+            tracking_video_predict_button,
+            video_output,
+            mask_dropdown,
+            remove_mask_button,
+            blur_video_predict_button,
+            blur_strength_slider,
+            run_status,
+        ],
+        queue=False,
+        show_progress=False,
+    )
+    # points clear
+    clear_button_click.click(
+        fn=clear_click,
+        inputs=[
+            video_state,
+            click_state,
+        ],
+        outputs=[template_frame, click_state, run_status],
+    )
+    # set example
+    gr.Markdown("##  Examples")
+    gr.Examples(
+        examples=[
+            os.path.join(os.path.dirname(__file__), "./data/", test_sample)
+            for test_sample in [
+                "sample-1.mp4",
+                "sample-2.mp4",
+            ]
+        ],
+        fn=run_example,
+        inputs=[video_input],
+        outputs=[video_input],
+    )
+iface.queue(concurrency_count=1)
+iface.launch(
+    debug=True, enable_queue=True
+)

data/sample-1.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dc49f2d9f5f00775248b8a66228f3e42304bbc391013d23ac66d21ba1f0e5fd2
+size 664422

data/sample-2.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:45ba5eb410e9d25744946afe61abff9e2ab0916d2f206637636ae30d0decd5e9
+size 1369798

export_onnx_model.py ADDED Viewed

	@@ -0,0 +1,201 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+from mobile_sam import sam_model_registry
+from mobile_sam.utils.onnx import SamOnnxModel
+import argparse
+import warnings
+try:
+    import onnxruntime  # type: ignore
+    onnxruntime_exists = True
+except ImportError:
+    onnxruntime_exists = False
+parser = argparse.ArgumentParser(
+    description="Export the SAM prompt encoder and mask decoder to an ONNX model."
+)
+parser.add_argument(
+    "--checkpoint", type=str, required=True, help="The path to the SAM model checkpoint."
+)
+parser.add_argument(
+    "--output", type=str, required=True, help="The filename to save the ONNX model to."
+)
+parser.add_argument(
+    "--model-type",
+    type=str,
+    required=True,
+    help="In ['default', 'vit_h', 'vit_l', 'vit_b']. Which type of SAM model to export.",
+)
+parser.add_argument(
+    "--return-single-mask",
+    action="store_true",
+    help=(
+        "If true, the exported ONNX model will only return the best mask, "
+        "instead of returning multiple masks. For high resolution images "
+        "this can improve runtime when upscaling masks is expensive."
+    ),
+)
+parser.add_argument(
+    "--opset",
+    type=int,
+    default=16,
+    help="The ONNX opset version to use. Must be >=11",
+)
+parser.add_argument(
+    "--quantize-out",
+    type=str,
+    default=None,
+    help=(
+        "If set, will quantize the model and save it with this name. "
+        "Quantization is performed with quantize_dynamic from onnxruntime.quantization.quantize."
+    ),
+)
+parser.add_argument(
+    "--gelu-approximate",
+    action="store_true",
+    help=(
+        "Replace GELU operations with approximations using tanh. Useful "
+        "for some runtimes that have slow or unimplemented erf ops, used in GELU."
+    ),
+)
+parser.add_argument(
+    "--use-stability-score",
+    action="store_true",
+    help=(
+        "Replaces the model's predicted mask quality score with the stability "
+        "score calculated on the low resolution masks using an offset of 1.0. "
+    ),
+)
+parser.add_argument(
+    "--return-extra-metrics",
+    action="store_true",
+    help=(
+        "The model will return five results: (masks, scores, stability_scores, "
+        "areas, low_res_logits) instead of the usual three. This can be "
+        "significantly slower for high resolution outputs."
+    ),
+)
+def run_export(
+    model_type: str,
+    checkpoint: str,
+    output: str,
+    opset: int,
+    return_single_mask: bool,
+    gelu_approximate: bool = False,
+    use_stability_score: bool = False,
+    return_extra_metrics=False,
+):
+    print("Loading model...")
+    sam = sam_model_registry[model_type](checkpoint=checkpoint)
+    onnx_model = SamOnnxModel(
+        model=sam,
+        return_single_mask=return_single_mask,
+        use_stability_score=use_stability_score,
+        return_extra_metrics=return_extra_metrics,
+    )
+    if gelu_approximate:
+        for n, m in onnx_model.named_modules():
+            if isinstance(m, torch.nn.GELU):
+                m.approximate = "tanh"
+    dynamic_axes = {
+        "point_coords": {1: "num_points"},
+        "point_labels": {1: "num_points"},
+    }
+    embed_dim = sam.prompt_encoder.embed_dim
+    embed_size = sam.prompt_encoder.image_embedding_size
+    mask_input_size = [4 * x for x in embed_size]
+    dummy_inputs = {
+        "image_embeddings": torch.randn(1, embed_dim, *embed_size, dtype=torch.float),
+        "point_coords": torch.randint(low=0, high=1024, size=(1, 5, 2), dtype=torch.float),
+        "point_labels": torch.randint(low=0, high=4, size=(1, 5), dtype=torch.float),
+        "mask_input": torch.randn(1, 1, *mask_input_size, dtype=torch.float),
+        "has_mask_input": torch.tensor([1], dtype=torch.float),
+        "orig_im_size": torch.tensor([1500, 2250], dtype=torch.float),
+    }
+    _ = onnx_model(**dummy_inputs)
+    output_names = ["masks", "iou_predictions", "low_res_masks"]
+    with warnings.catch_warnings():
+        warnings.filterwarnings("ignore", category=torch.jit.TracerWarning)
+        warnings.filterwarnings("ignore", category=UserWarning)
+        with open(output, "wb") as f:
+            print(f"Exporting onnx model to {output}...")
+            torch.onnx.export(
+                onnx_model,
+                tuple(dummy_inputs.values()),
+                f,
+                export_params=True,
+                verbose=False,
+                opset_version=opset,
+                do_constant_folding=True,
+                input_names=list(dummy_inputs.keys()),
+                output_names=output_names,
+                dynamic_axes=dynamic_axes,
+            )
+    if onnxruntime_exists:
+        ort_inputs = {k: to_numpy(v) for k, v in dummy_inputs.items()}
+        # set cpu provider default
+        providers = ["CPUExecutionProvider"]
+        ort_session = onnxruntime.InferenceSession(output, providers=providers)
+        _ = ort_session.run(None, ort_inputs)
+        print("Model has successfully been run with ONNXRuntime.")
+def to_numpy(tensor):
+    return tensor.cpu().numpy()
+if __name__ == "__main__":
+    args = parser.parse_args()
+    run_export(
+        model_type=args.model_type,
+        checkpoint=args.checkpoint,
+        output=args.output,
+        opset=args.opset,
+        return_single_mask=args.return_single_mask,
+        gelu_approximate=args.gelu_approximate,
+        use_stability_score=args.use_stability_score,
+        return_extra_metrics=args.return_extra_metrics,
+    )
+    if args.quantize_out is not None:
+        assert onnxruntime_exists, "onnxruntime is required to quantize the model."
+        from onnxruntime.quantization import QuantType  # type: ignore
+        from onnxruntime.quantization.quantize import quantize_dynamic  # type: ignore
+        print(f"Quantizing model and writing to {args.quantize_out}...")
+        quantize_dynamic(
+            model_input=args.output,
+            model_output=args.quantize_out,
+            optimize_model=True,
+            per_channel=False,
+            reduce_range=False,
+            weight_type=QuantType.QUInt8,
+        )
+        print("Done!")

packages.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ python3-opencv

poetry.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

pyproject.toml ADDED Viewed

	@@ -0,0 +1,31 @@

+[tool.poetry]
+name = "Blur-Anything"
+version = "0.1.0"
+description = "Track and blur any object or person in a video."
+authors = ["Y-T-G <yaseensinbox@gmail.com>"]
+license = "MIT"
+readme = "README.md"
+packages = [{include = "blur_anything"}]
+[tool.poetry.dependencies]
+python = "^3.9"
+gradio = "^3.28.1"
+numpy = "^1.24.3"
+av = "^10.0.0"
+torch = "^2.0.0"
+opencv-python = "^4.7.0.72"
+psutil = "^5.9.5"
+tqdm = "^4.65.0"
+matplotlib = "^3.7.1"
+segment-anything = {git = "https://github.com/facebookresearch/segment-anything.git"}
+torchvision = "^0.15.1"
+pims = "^0.6.1"
+mobile-sam = {git = "https://github.com/ChaoningZhang/MobileSAM.git"}
+onnxruntime = "^1.15.1"
+timm = "^0.9.5"
+onnx = "^1.14.0"
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"

requirements.txt ADDED Viewed

	@@ -0,0 +1,106 @@

+aiofiles==23.1.0
+aiohttp==3.8.4
+aiosignal==1.3.1
+altair==4.2.2
+anyio==3.6.2
+async-timeout==4.0.2
+attrs==23.1.0
+av==10.0.0
+certifi==2022.12.7
+charset-normalizer==3.1.0
+click==8.1.3
+cmake==3.26.3
+colorama==0.4.6
+coloredlogs==15.0.1
+contourpy==1.0.7
+cycler==0.11.0
+entrypoints==0.4
+fastapi==0.95.1
+ffmpy==0.3.0
+filelock==3.12.0
+flatbuffers==23.5.26
+fonttools==4.39.3
+frozenlist==1.3.3
+fsspec==2023.4.0
+gradio-client==0.1.4
+gradio==3.28.3
+h11==0.14.0
+httpcore==0.17.0
+httpx==0.24.0
+huggingface-hub==0.14.1
+humanfriendly==10.0
+idna==3.4
+imageio==2.28.1
+importlib-resources==5.12.0
+jinja2==3.1.2
+jsonschema==4.17.3
+kiwisolver==1.4.4
+linkify-it-py==2.0.2
+lit==16.0.2
+markdown-it-py==2.2.0
+markdown-it-py[linkify]==2.2.0
+markupsafe==2.1.2
+matplotlib==3.7.1
+mdit-py-plugins==0.3.3
+mdurl==0.1.2
+mobile-sam @ git+https://github.com/ChaoningZhang/MobileSAM.git
+mpmath==1.3.0
+multidict==6.0.4
+networkx==3.1
+numpy==1.24.3
+nvidia-cublas-cu11==11.10.3.66
+nvidia-cuda-cupti-cu11==11.7.101
+nvidia-cuda-nvrtc-cu11==11.7.99
+nvidia-cuda-runtime-cu11==11.7.99
+nvidia-cudnn-cu11==8.5.0.96
+nvidia-cufft-cu11==10.9.0.58
+nvidia-curand-cu11==10.2.10.91
+nvidia-cusolver-cu11==11.4.0.1
+nvidia-cusparse-cu11==11.7.4.91
+nvidia-nccl-cu11==2.14.3
+nvidia-nvtx-cu11==11.7.91
+onnx==1.14.0
+onnxruntime==1.15.1
+opencv-python==4.7.0.72
+orjson==3.8.11
+packaging==23.1
+pandas==2.0.1
+pillow==9.5.0
+pims==0.6.1
+protobuf==4.24.0
+psutil==5.9.5
+pydantic==1.10.7
+pydub==0.25.1
+pygments==2.15.1
+pyparsing==3.0.9
+pyreadline3==3.4.1
+pyrsistent==0.19.3
+python-dateutil==2.8.2
+python-multipart==0.0.6
+pytz==2023.3
+pyyaml==6.0
+requests==2.30.0
+safetensors==0.3.2
+segment-anything @ git+https://github.com/facebookresearch/segment-anything.git
+semantic-version==2.10.0
+setuptools==67.7.2
+six==1.16.0
+slicerator==1.1.0
+sniffio==1.3.0
+starlette==0.26.1
+sympy==1.11.1
+timm==0.9.5
+toolz==0.12.0
+torch==2.0.0
+torchvision==0.15.1
+tqdm==4.65.0
+triton==2.0.0
+typing-extensions==4.5.0
+tzdata==2023.3
+uc-micro-py==1.0.2
+urllib3==2.0.2
+uvicorn==0.22.0
+websockets==11.0.2
+wheel==0.40.0
+yarl==1.9.2
+zipp==3.15.0

track_anything.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import os
+from tqdm import tqdm
+from utils.interact_tools import SamControler
+from tracker.base_tracker import BaseTracker
+import numpy as np
+import argparse
+import cv2
+from typing import Optional
+class TrackingAnything:
+    def __init__(self, sam_pt_checkpoint, sam_onnx_checkpoint, xmem_checkpoint, args):
+        self.args = args
+        self.sam_pt_checkpoint = sam_pt_checkpoint
+        self.sam_onnx_checkpoint = sam_onnx_checkpoint
+        self.xmem_checkpoint = xmem_checkpoint
+        self.samcontroler = SamControler(
+            self.sam_pt_checkpoint, self.sam_onnx_checkpoint, args.sam_model_type, args.device
+        )
+        self.xmem = BaseTracker(self.xmem_checkpoint, device=args.device)
+    def first_frame_click(
+        self, image: np.ndarray, points: np.ndarray, labels: np.ndarray, multimask=True
+    ):
+        mask, logit, painted_image = self.samcontroler.first_frame_click(
+            image, points, labels, multimask
+        )
+        return mask, logit, painted_image
+    def generator(
+        self,
+        images: list,
+        template_mask: np.ndarray,
+        write: Optional[bool] = False,
+        fps: Optional[int] = "30",
+        output_path: Optional[str] = "tracking.mp4",
+    ):
+        masks = []
+        logits = []
+        painted_images = []
+        if write:
+            size = images[0].shape[:2][::-1]
+            if not os.path.exists(os.path.dirname(output_path)):
+                os.makedirs(os.path.dirname(output_path))
+            writer = cv2.VideoWriter(
+                output_path, cv2.VideoWriter_fourcc(*"mp4v"), fps, size
+            )
+        for i in tqdm(range(len(images)), desc="Tracking image"):
+            if i == 0:
+                mask, logit, painted_image = self.xmem.track(images[i], template_mask)
+            else:
+                mask, logit, painted_image = self.xmem.track(images[i])
+            masks.append(mask)
+            logits.append(logit)
+            if write:
+                writer.write(painted_image[:,:,::-1])
+            else:
+                painted_images.append(painted_image)
+        if write:
+            writer.release()
+        return masks, logits, painted_images
+def parse_augment():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--device", type=str, default="cpu")
+    parser.add_argument("--sam_model_type", type=str, default="vit_t")
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=6080,
+        help="only useful when running gradio applications",
+    )
+    parser.add_argument("--debug", action="store_true")
+    parser.add_argument("--mask_save", default=False)
+    args = parser.parse_args()
+    if args.debug:
+        print(args)
+    return args

tracker/base_tracker.py ADDED Viewed

	@@ -0,0 +1,142 @@

+# import for debugging
+import os
+import glob
+import numpy as np
+from PIL import Image
+# import for base_tracker
+import torch
+import yaml
+import torch.nn.functional as F
+from tracker.model.network import XMem
+from inference.inference_core import InferenceCore
+from tracker.util.mask_mapper import MaskMapper
+from torchvision import transforms
+from tracker.util.range_transform import im_normalization
+from utils.painter import mask_painter
+dir_path = os.path.dirname(os.path.realpath(__file__))
+class BaseTracker:
+    def __init__(
+        self, xmem_checkpoint, device, sam_model=None, model_type=None
+    ) -> None:
+        """
+        device: model device
+        xmem_checkpoint: checkpoint of XMem model
+        """
+        # load configurations
+        with open(f"{dir_path}/config/config.yaml", "r") as stream:
+            config = yaml.safe_load(stream)
+        # initialise XMem
+        network = XMem(config, xmem_checkpoint, map_location=device).eval()
+        # initialise IncerenceCore
+        self.tracker = InferenceCore(network, config)
+        # data transformation
+        self.im_transform = transforms.Compose(
+            [
+                transforms.ToTensor(),
+                im_normalization,
+            ]
+        )
+        self.device = device
+        # changable properties
+        self.mapper = MaskMapper()
+        self.initialised = False
+        # # SAM-based refinement
+        # self.sam_model = sam_model
+        # self.resizer = Resize([256, 256])
+    @torch.no_grad()
+    def resize_mask(self, mask):
+        # mask transform is applied AFTER mapper, so we need to post-process it in eval.py
+        h, w = mask.shape[-2:]
+        min_hw = min(h, w)
+        return F.interpolate(
+            mask,
+            (int(h / min_hw * self.size), int(w / min_hw * self.size)),
+            mode="nearest",
+        )
+    @torch.no_grad()
+    def track(self, frame, first_frame_annotation=None):
+        """
+        Input:
+        frames: numpy arrays (H, W, 3)
+        logit: numpy array (H, W), logit
+        Output:
+        mask: numpy arrays (H, W)
+        logit: numpy arrays, probability map (H, W)
+        painted_image: numpy array (H, W, 3)
+        """
+        if first_frame_annotation is not None:  # first frame mask
+            # initialisation
+            mask, labels = self.mapper.convert_mask(first_frame_annotation)
+            mask = torch.Tensor(mask).to(self.device)
+            self.tracker.set_all_labels(list(self.mapper.remappings.values()))
+        else:
+            mask = None
+            labels = None
+        # prepare inputs
+        frame_tensor = self.im_transform(frame).to(self.device)
+        # track one frame
+        probs, _ = self.tracker.step(frame_tensor, mask, labels)  # logits 2 (bg fg) H W
+        # # refine
+        # if first_frame_annotation is None:
+        #     out_mask = self.sam_refinement(frame, logits[1], ti)
+        # convert to mask
+        out_mask = torch.argmax(probs, dim=0)
+        out_mask = (out_mask.detach().cpu().numpy()).astype(np.uint8)
+        final_mask = np.zeros_like(out_mask)
+        # map back
+        for k, v in self.mapper.remappings.items():
+            final_mask[out_mask == v] = k
+        num_objs = final_mask.max()
+        painted_image = frame
+        for obj in range(1, num_objs + 1):
+            if np.max(final_mask == obj) == 0:
+                continue
+            painted_image = mask_painter(
+                painted_image, (final_mask == obj).astype("uint8"), mask_color=obj + 1
+            )
+        # print(f'max memory allocated: {torch.cuda.max_memory_allocated()/(2**20)} MB')
+        return final_mask, final_mask, painted_image
+    @torch.no_grad()
+    def sam_refinement(self, frame, logits, ti):
+        """
+        refine segmentation results with mask prompt
+        """
+        # convert to 1, 256, 256
+        self.sam_model.set_image(frame)
+        mode = "mask"
+        logits = logits.unsqueeze(0)
+        logits = self.resizer(logits).cpu().numpy()
+        prompts = {"mask_input": logits}  # 1 256 256
+        masks, scores, logits = self.sam_model.predict(
+            prompts, mode, multimask=True
+        )  # masks (n, h, w), scores (n,), logits (n, 256, 256)
+        painted_image = mask_painter(
+            frame, masks[np.argmax(scores)].astype("uint8"), mask_alpha=0.8
+        )
+        painted_image = Image.fromarray(painted_image)
+        painted_image.save(f"/ssd1/gaomingqi/refine/{ti:05d}.png")
+        self.sam_model.reset_image()
+    @torch.no_grad()
+    def clear_memory(self):
+        self.tracker.clear_memory()
+        self.mapper.clear_labels()
+        torch.cuda.empty_cache()

tracker/config/config.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+# config info for XMem
+benchmark: False
+disable_long_term: False
+max_mid_term_frames: 10
+min_mid_term_frames: 5
+max_long_term_elements: 1000
+num_prototypes: 128
+top_k: 30
+mem_every: 5
+deep_update_every: -1
+save_scores: False
+flip: False
+size: 480
+enable_long_term: True
+enable_long_term_count_usage: True

tracker/inference/__init__.py ADDED Viewed

File without changes

tracker/inference/inference_core.py ADDED Viewed

	@@ -0,0 +1,149 @@

+from inference.memory_manager import MemoryManager
+from model.network import XMem
+from model.aggregate import aggregate
+from tracker.util.tensor_util import pad_divide_by, unpad
+class InferenceCore:
+    def __init__(self, network: XMem, config):
+        self.config = config
+        self.network = network
+        self.mem_every = config["mem_every"]
+        self.deep_update_every = config["deep_update_every"]
+        self.enable_long_term = config["enable_long_term"]
+        # if deep_update_every < 0, synchronize deep update with memory frame
+        self.deep_update_sync = self.deep_update_every < 0
+        self.clear_memory()
+        self.all_labels = None
+    def clear_memory(self):
+        self.curr_ti = -1
+        self.last_mem_ti = 0
+        if not self.deep_update_sync:
+            self.last_deep_update_ti = -self.deep_update_every
+        self.memory = MemoryManager(config=self.config)
+    def update_config(self, config):
+        self.mem_every = config["mem_every"]
+        self.deep_update_every = config["deep_update_every"]
+        self.enable_long_term = config["enable_long_term"]
+        # if deep_update_every < 0, synchronize deep update with memory frame
+        self.deep_update_sync = self.deep_update_every < 0
+        self.memory.update_config(config)
+    def set_all_labels(self, all_labels):
+        # self.all_labels = [l.item() for l in all_labels]
+        self.all_labels = all_labels
+    def step(self, image, mask=None, valid_labels=None, end=False):
+        # image: 3*H*W
+        # mask: num_objects*H*W or None
+        self.curr_ti += 1
+        image, self.pad = pad_divide_by(image, 16)
+        image = image.unsqueeze(0)  # add the batch dimension
+        is_mem_frame = (
+            (self.curr_ti - self.last_mem_ti >= self.mem_every) or (mask is not None)
+        ) and (not end)
+        need_segment = (self.curr_ti > 0) and (
+            (valid_labels is None) or (len(self.all_labels) != len(valid_labels))
+        )
+        is_deep_update = (
+            (self.deep_update_sync and is_mem_frame)
+            or (  # synchronized
+                not self.deep_update_sync
+                and self.curr_ti - self.last_deep_update_ti >= self.deep_update_every
+            )  # no-sync
+        ) and (not end)
+        is_normal_update = (not self.deep_update_sync or not is_deep_update) and (
+            not end
+        )
+        key, shrinkage, selection, f16, f8, f4 = self.network.encode_key(
+            image, need_ek=(self.enable_long_term or need_segment), need_sk=is_mem_frame
+        )
+        multi_scale_features = (f16, f8, f4)
+        # segment the current frame is needed
+        if need_segment:
+            memory_readout = self.memory.match_memory(key, selection).unsqueeze(0)
+            hidden, pred_logits_with_bg, pred_prob_with_bg = self.network.segment(
+                multi_scale_features,
+                memory_readout,
+                self.memory.get_hidden(),
+                h_out=is_normal_update,
+                strip_bg=False,
+            )
+            # remove batch dim
+            pred_prob_with_bg = pred_prob_with_bg[0]
+            pred_prob_no_bg = pred_prob_with_bg[1:]
+            pred_logits_with_bg = pred_logits_with_bg[0]
+            pred_logits_no_bg = pred_logits_with_bg[1:]
+            if is_normal_update:
+                self.memory.set_hidden(hidden)
+        else:
+            pred_prob_no_bg = (
+                pred_prob_with_bg
+            ) = pred_logits_with_bg = pred_logits_no_bg = None
+        # use the input mask if any
+        if mask is not None:
+            mask, _ = pad_divide_by(mask, 16)
+            if pred_prob_no_bg is not None:
+                # if we have a predicted mask, we work on it
+                # make pred_prob_no_bg consistent with the input mask
+                mask_regions = mask.sum(0) > 0.5
+                pred_prob_no_bg[:, mask_regions] = 0
+                # shift by 1 because mask/pred_prob_no_bg do not contain background
+                mask = mask.type_as(pred_prob_no_bg)
+                if valid_labels is not None:
+                    shift_by_one_non_labels = [
+                        i
+                        for i in range(pred_prob_no_bg.shape[0])
+                        if (i + 1) not in valid_labels
+                    ]
+                    # non-labelled objects are copied from the predicted mask
+                    mask[shift_by_one_non_labels] = pred_prob_no_bg[
+                        shift_by_one_non_labels
+                    ]
+            pred_prob_with_bg = aggregate(mask, dim=0)
+            # also create new hidden states
+            self.memory.create_hidden_state(len(self.all_labels), key)
+        # save as memory if needed
+        if is_mem_frame:
+            value, hidden = self.network.encode_value(
+                image,
+                f16,
+                self.memory.get_hidden(),
+                pred_prob_with_bg[1:].unsqueeze(0),
+                is_deep_update=is_deep_update,
+            )
+            self.memory.add_memory(
+                key,
+                shrinkage,
+                value,
+                self.all_labels,
+                selection=selection if self.enable_long_term else None,
+            )
+            self.last_mem_ti = self.curr_ti
+            if is_deep_update:
+                self.memory.set_hidden(hidden)
+                self.last_deep_update_ti = self.curr_ti
+        if pred_logits_with_bg is None:
+            return unpad(pred_prob_with_bg, self.pad), None
+        else:
+            return unpad(pred_prob_with_bg, self.pad), unpad(
+                pred_logits_with_bg, self.pad
+            )

tracker/inference/kv_memory_store.py ADDED Viewed

	@@ -0,0 +1,234 @@

+import torch
+from typing import List
+class KeyValueMemoryStore:
+    """
+    Works for key/value pairs type storage
+    e.g., working and long-term memory
+    """
+    """
+    An object group is created when new objects enter the video
+    Objects in the same group share the same temporal extent
+    i.e., objects initialized in the same frame are in the same group
+    For DAVIS/interactive, there is only one object group
+    For YouTubeVOS, there can be multiple object groups
+    """
+    def __init__(self, count_usage: bool):
+        self.count_usage = count_usage
+        # keys are stored in a single tensor and are shared between groups/objects
+        # values are stored as a list indexed by object groups
+        self.k = None
+        self.v = []
+        self.obj_groups = []
+        # for debugging only
+        self.all_objects = []
+        # shrinkage and selection are also single tensors
+        self.s = self.e = None
+        # usage
+        if self.count_usage:
+            self.use_count = self.life_count = None
+    def add(self, key, value, shrinkage, selection, objects: List[int]):
+        new_count = torch.zeros(
+            (key.shape[0], 1, key.shape[2]), device=key.device, dtype=torch.float32
+        )
+        new_life = (
+            torch.zeros(
+                (key.shape[0], 1, key.shape[2]), device=key.device, dtype=torch.float32
+            )
+            + 1e-7
+        )
+        # add the key
+        if self.k is None:
+            self.k = key
+            self.s = shrinkage
+            self.e = selection
+            if self.count_usage:
+                self.use_count = new_count
+                self.life_count = new_life
+        else:
+            self.k = torch.cat([self.k, key], -1)
+            if shrinkage is not None:
+                self.s = torch.cat([self.s, shrinkage], -1)
+            if selection is not None:
+                self.e = torch.cat([self.e, selection], -1)
+            if self.count_usage:
+                self.use_count = torch.cat([self.use_count, new_count], -1)
+                self.life_count = torch.cat([self.life_count, new_life], -1)
+        # add the value
+        if objects is not None:
+            # When objects is given, v is a tensor; used in working memory
+            assert isinstance(value, torch.Tensor)
+            # First consume objects that are already in the memory bank
+            # cannot use set here because we need to preserve order
+            # shift by one as background is not part of value
+            remaining_objects = [obj - 1 for obj in objects]
+            for gi, group in enumerate(self.obj_groups):
+                for obj in group:
+                    # should properly raise an error if there are overlaps in obj_groups
+                    remaining_objects.remove(obj)
+                self.v[gi] = torch.cat([self.v[gi], value[group]], -1)
+            # If there are remaining objects, add them as a new group
+            if len(remaining_objects) > 0:
+                new_group = list(remaining_objects)
+                self.v.append(value[new_group])
+                self.obj_groups.append(new_group)
+                self.all_objects.extend(new_group)
+                assert (
+                    sorted(self.all_objects) == self.all_objects
+                ), "Objects MUST be inserted in sorted order "
+        else:
+            # When objects is not given, v is a list that already has the object groups sorted
+            # used in long-term memory
+            assert isinstance(value, list)
+            for gi, gv in enumerate(value):
+                if gv is None:
+                    continue
+                if gi < self.num_groups:
+                    self.v[gi] = torch.cat([self.v[gi], gv], -1)
+                else:
+                    self.v.append(gv)
+    def update_usage(self, usage):
+        # increase all life count by 1
+        # increase use of indexed elements
+        if not self.count_usage:
+            return
+        self.use_count += usage.view_as(self.use_count)
+        self.life_count += 1
+    def sieve_by_range(self, start: int, end: int, min_size: int):
+        # keep only the elements *outside* of this range (with some boundary conditions)
+        # i.e., concat (a[:start], a[end:])
+        # min_size is only used for values, we do not sieve values under this size
+        # (because they are not consolidated)
+        if end == 0:
+            # negative 0 would not work as the end index!
+            self.k = self.k[:, :, :start]
+            if self.count_usage:
+                self.use_count = self.use_count[:, :, :start]
+                self.life_count = self.life_count[:, :, :start]
+            if self.s is not None:
+                self.s = self.s[:, :, :start]
+            if self.e is not None:
+                self.e = self.e[:, :, :start]
+            for gi in range(self.num_groups):
+                if self.v[gi].shape[-1] >= min_size:
+                    self.v[gi] = self.v[gi][:, :, :start]
+        else:
+            self.k = torch.cat([self.k[:, :, :start], self.k[:, :, end:]], -1)
+            if self.count_usage:
+                self.use_count = torch.cat(
+                    [self.use_count[:, :, :start], self.use_count[:, :, end:]], -1
+                )
+                self.life_count = torch.cat(
+                    [self.life_count[:, :, :start], self.life_count[:, :, end:]], -1
+                )
+            if self.s is not None:
+                self.s = torch.cat([self.s[:, :, :start], self.s[:, :, end:]], -1)
+            if self.e is not None:
+                self.e = torch.cat([self.e[:, :, :start], self.e[:, :, end:]], -1)
+            for gi in range(self.num_groups):
+                if self.v[gi].shape[-1] >= min_size:
+                    self.v[gi] = torch.cat(
+                        [self.v[gi][:, :, :start], self.v[gi][:, :, end:]], -1
+                    )
+    def remove_obsolete_features(self, max_size: int):
+        # normalize with life duration
+        usage = self.get_usage().flatten()
+        values, _ = torch.topk(
+            usage, k=(self.size - max_size), largest=False, sorted=True
+        )
+        survived = usage > values[-1]
+        self.k = self.k[:, :, survived]
+        self.s = self.s[:, :, survived] if self.s is not None else None
+        # Long-term memory does not store ek so this should not be needed
+        self.e = self.e[:, :, survived] if self.e is not None else None
+        if self.num_groups > 1:
+            raise NotImplementedError(
+                """The current data structure does not support feature removal with
+            multiple object groups (e.g., some objects start to appear later in the video)
+            The indices for "survived" is based on keys but not all values are present for every key
+            Basically we need to remap the indices for keys to values
+            """
+            )
+        for gi in range(self.num_groups):
+            self.v[gi] = self.v[gi][:, :, survived]
+        self.use_count = self.use_count[:, :, survived]
+        self.life_count = self.life_count[:, :, survived]
+    def get_usage(self):
+        # return normalized usage
+        if not self.count_usage:
+            raise RuntimeError("I did not count usage!")
+        else:
+            usage = self.use_count / self.life_count
+            return usage
+    def get_all_sliced(self, start: int, end: int):
+        # return k, sk, ek, usage in order, sliced by start and end
+        if end == 0:
+            # negative 0 would not work as the end index!
+            k = self.k[:, :, start:]
+            sk = self.s[:, :, start:] if self.s is not None else None
+            ek = self.e[:, :, start:] if self.e is not None else None
+            usage = self.get_usage()[:, :, start:]
+        else:
+            k = self.k[:, :, start:end]
+            sk = self.s[:, :, start:end] if self.s is not None else None
+            ek = self.e[:, :, start:end] if self.e is not None else None
+            usage = self.get_usage()[:, :, start:end]
+        return k, sk, ek, usage
+    def get_v_size(self, ni: int):
+        return self.v[ni].shape[2]
+    def engaged(self):
+        return self.k is not None
+    @property
+    def size(self):
+        if self.k is None:
+            return 0
+        else:
+            return self.k.shape[-1]
+    @property
+    def num_groups(self):
+        return len(self.v)
+    @property
+    def key(self):
+        return self.k
+    @property
+    def value(self):
+        return self.v
+    @property
+    def shrinkage(self):
+        return self.s
+    @property
+    def selection(self):
+        return self.e

tracker/inference/memory_manager.py ADDED Viewed

	@@ -0,0 +1,373 @@

+import torch
+import warnings
+from inference.kv_memory_store import KeyValueMemoryStore
+from model.memory_util import *
+class MemoryManager:
+    """
+    Manages all three memory stores and the transition between working/long-term memory
+    """
+    def __init__(self, config):
+        self.hidden_dim = config["hidden_dim"]
+        self.top_k = config["top_k"]
+        self.enable_long_term = config["enable_long_term"]
+        self.enable_long_term_usage = config["enable_long_term_count_usage"]
+        if self.enable_long_term:
+            self.max_mt_frames = config["max_mid_term_frames"]
+            self.min_mt_frames = config["min_mid_term_frames"]
+            self.num_prototypes = config["num_prototypes"]
+            self.max_long_elements = config["max_long_term_elements"]
+        # dimensions will be inferred from input later
+        self.CK = self.CV = None
+        self.H = self.W = None
+        # The hidden state will be stored in a single tensor for all objects
+        # B x num_objects x CH x H x W
+        self.hidden = None
+        self.work_mem = KeyValueMemoryStore(count_usage=self.enable_long_term)
+        if self.enable_long_term:
+            self.long_mem = KeyValueMemoryStore(count_usage=self.enable_long_term_usage)
+        self.reset_config = True
+    def update_config(self, config):
+        self.reset_config = True
+        self.hidden_dim = config["hidden_dim"]
+        self.top_k = config["top_k"]
+        assert self.enable_long_term == config["enable_long_term"], "cannot update this"
+        assert (
+            self.enable_long_term_usage == config["enable_long_term_count_usage"]
+        ), "cannot update this"
+        self.enable_long_term_usage = config["enable_long_term_count_usage"]
+        if self.enable_long_term:
+            self.max_mt_frames = config["max_mid_term_frames"]
+            self.min_mt_frames = config["min_mid_term_frames"]
+            self.num_prototypes = config["num_prototypes"]
+            self.max_long_elements = config["max_long_term_elements"]
+    def _readout(self, affinity, v):
+        # this function is for a single object group
+        return v @ affinity
+    def match_memory(self, query_key, selection):
+        # query_key: B x C^k x H x W
+        # selection:  B x C^k x H x W
+        num_groups = self.work_mem.num_groups
+        h, w = query_key.shape[-2:]
+        query_key = query_key.flatten(start_dim=2)
+        selection = selection.flatten(start_dim=2) if selection is not None else None
+        """
+        Memory readout using keys
+        """
+        if self.enable_long_term and self.long_mem.engaged():
+            # Use long-term memory
+            long_mem_size = self.long_mem.size
+            memory_key = torch.cat([self.long_mem.key, self.work_mem.key], -1)
+            shrinkage = torch.cat(
+                [self.long_mem.shrinkage, self.work_mem.shrinkage], -1
+            )
+            similarity = get_similarity(memory_key, shrinkage, query_key, selection)
+            work_mem_similarity = similarity[:, long_mem_size:]
+            long_mem_similarity = similarity[:, :long_mem_size]
+            # get the usage with the first group
+            # the first group always have all the keys valid
+            affinity, usage = do_softmax(
+                torch.cat(
+                    [
+                        long_mem_similarity[:, -self.long_mem.get_v_size(0) :],
+                        work_mem_similarity,
+                    ],
+                    1,
+                ),
+                top_k=self.top_k,
+                inplace=True,
+                return_usage=True,
+            )
+            affinity = [affinity]
+            # compute affinity group by group as later groups only have a subset of keys
+            for gi in range(1, num_groups):
+                if gi < self.long_mem.num_groups:
+                    # merge working and lt similarities before softmax
+                    affinity_one_group = do_softmax(
+                        torch.cat(
+                            [
+                                long_mem_similarity[:, -self.long_mem.get_v_size(gi) :],
+                                work_mem_similarity[:, -self.work_mem.get_v_size(gi) :],
+                            ],
+                            1,
+                        ),
+                        top_k=self.top_k,
+                        inplace=True,
+                    )
+                else:
+                    # no long-term memory for this group
+                    affinity_one_group = do_softmax(
+                        work_mem_similarity[:, -self.work_mem.get_v_size(gi) :],
+                        top_k=self.top_k,
+                        inplace=(gi == num_groups - 1),
+                    )
+                affinity.append(affinity_one_group)
+            all_memory_value = []
+            for gi, gv in enumerate(self.work_mem.value):
+                # merge the working and lt values before readout
+                if gi < self.long_mem.num_groups:
+                    all_memory_value.append(
+                        torch.cat(
+                            [self.long_mem.value[gi], self.work_mem.value[gi]], -1
+                        )
+                    )
+                else:
+                    all_memory_value.append(gv)
+            """
+            Record memory usage for working and long-term memory
+            """
+            # ignore the index return for long-term memory
+            work_usage = usage[:, long_mem_size:]
+            self.work_mem.update_usage(work_usage.flatten())
+            if self.enable_long_term_usage:
+                # ignore the index return for working memory
+                long_usage = usage[:, :long_mem_size]
+                self.long_mem.update_usage(long_usage.flatten())
+        else:
+            # No long-term memory
+            similarity = get_similarity(
+                self.work_mem.key, self.work_mem.shrinkage, query_key, selection
+            )
+            if self.enable_long_term:
+                affinity, usage = do_softmax(
+                    similarity,
+                    inplace=(num_groups == 1),
+                    top_k=self.top_k,
+                    return_usage=True,
+                )
+                # Record memory usage for working memory
+                self.work_mem.update_usage(usage.flatten())
+            else:
+                affinity = do_softmax(
+                    similarity,
+                    inplace=(num_groups == 1),
+                    top_k=self.top_k,
+                    return_usage=False,
+                )
+            affinity = [affinity]
+            # compute affinity group by group as later groups only have a subset of keys
+            for gi in range(1, num_groups):
+                affinity_one_group = do_softmax(
+                    similarity[:, -self.work_mem.get_v_size(gi) :],
+                    top_k=self.top_k,
+                    inplace=(gi == num_groups - 1),
+                )
+                affinity.append(affinity_one_group)
+            all_memory_value = self.work_mem.value
+        # Shared affinity within each group
+        all_readout_mem = torch.cat(
+            [self._readout(affinity[gi], gv) for gi, gv in enumerate(all_memory_value)],
+            0,
+        )
+        return all_readout_mem.view(all_readout_mem.shape[0], self.CV, h, w)
+    def add_memory(self, key, shrinkage, value, objects, selection=None):
+        # key: 1*C*H*W
+        # value: 1*num_objects*C*H*W
+        # objects contain a list of object indices
+        if self.H is None or self.reset_config:
+            self.reset_config = False
+            self.H, self.W = key.shape[-2:]
+            self.HW = self.H * self.W
+            if self.enable_long_term:
+                # convert from num. frames to num. nodes
+                self.min_work_elements = self.min_mt_frames * self.HW
+                self.max_work_elements = self.max_mt_frames * self.HW
+        # key:   1*C*N
+        # value: num_objects*C*N
+        key = key.flatten(start_dim=2)
+        shrinkage = shrinkage.flatten(start_dim=2)
+        value = value[0].flatten(start_dim=2)
+        self.CK = key.shape[1]
+        self.CV = value.shape[1]
+        if selection is not None:
+            if not self.enable_long_term:
+                warnings.warn(
+                    "the selection factor is only needed in long-term mode", UserWarning
+                )
+            selection = selection.flatten(start_dim=2)
+        self.work_mem.add(key, value, shrinkage, selection, objects)
+        # long-term memory cleanup
+        if self.enable_long_term:
+            # Do memory compressed if needed
+            if self.work_mem.size >= self.max_work_elements:
+                # print('remove memory')
+                # Remove obsolete features if needed
+                if self.long_mem.size >= (self.max_long_elements - self.num_prototypes):
+                    self.long_mem.remove_obsolete_features(
+                        self.max_long_elements - self.num_prototypes
+                    )
+                self.compress_features()
+    def create_hidden_state(self, n, sample_key):
+        # n is the TOTAL number of objects
+        h, w = sample_key.shape[-2:]
+        if self.hidden is None:
+            self.hidden = torch.zeros(
+                (1, n, self.hidden_dim, h, w), device=sample_key.device
+            )
+        elif self.hidden.shape[1] != n:
+            self.hidden = torch.cat(
+                [
+                    self.hidden,
+                    torch.zeros(
+                        (1, n - self.hidden.shape[1], self.hidden_dim, h, w),
+                        device=sample_key.device,
+                    ),
+                ],
+                1,
+            )
+        assert self.hidden.shape[1] == n
+    def set_hidden(self, hidden):
+        self.hidden = hidden
+    def get_hidden(self):
+        return self.hidden
+    def compress_features(self):
+        HW = self.HW
+        candidate_value = []
+        total_work_mem_size = self.work_mem.size
+        for gv in self.work_mem.value:
+            # Some object groups might be added later in the video
+            # So not all keys have values associated with all objects
+            # We need to keep track of the key->value validity
+            mem_size_in_this_group = gv.shape[-1]
+            if mem_size_in_this_group == total_work_mem_size:
+                # full LT
+                candidate_value.append(gv[:, :, HW : -self.min_work_elements + HW])
+            else:
+                # mem_size is smaller than total_work_mem_size, but at least HW
+                assert HW <= mem_size_in_this_group < total_work_mem_size
+                if mem_size_in_this_group > self.min_work_elements + HW:
+                    # part of this object group still goes into LT
+                    candidate_value.append(gv[:, :, HW : -self.min_work_elements + HW])
+                else:
+                    # this object group cannot go to the LT at all
+                    candidate_value.append(None)
+        # perform memory consolidation
+        prototype_key, prototype_value, prototype_shrinkage = self.consolidation(
+            *self.work_mem.get_all_sliced(HW, -self.min_work_elements + HW),
+            candidate_value
+        )
+        # remove consolidated working memory
+        self.work_mem.sieve_by_range(
+            HW, -self.min_work_elements + HW, min_size=self.min_work_elements + HW
+        )
+        # add to long-term memory
+        self.long_mem.add(
+            prototype_key,
+            prototype_value,
+            prototype_shrinkage,
+            selection=None,
+            objects=None,
+        )
+        # print(f'long memory size: {self.long_mem.size}')
+        # print(f'work memory size: {self.work_mem.size}')
+    def consolidation(
+        self,
+        candidate_key,
+        candidate_shrinkage,
+        candidate_selection,
+        usage,
+        candidate_value,
+    ):
+        # keys: 1*C*N
+        # values: num_objects*C*N
+        N = candidate_key.shape[-1]
+        # find the indices with max usage
+        _, max_usage_indices = torch.topk(
+            usage, k=self.num_prototypes, dim=-1, sorted=True
+        )
+        prototype_indices = max_usage_indices.flatten()
+        # Prototypes are invalid for out-of-bound groups
+        validity = [
+            prototype_indices >= (N - gv.shape[2]) if gv is not None else None
+            for gv in candidate_value
+        ]
+        prototype_key = candidate_key[:, :, prototype_indices]
+        prototype_selection = (
+            candidate_selection[:, :, prototype_indices]
+            if candidate_selection is not None
+            else None
+        )
+        """
+        Potentiation step
+        """
+        similarity = get_similarity(
+            candidate_key, candidate_shrinkage, prototype_key, prototype_selection
+        )
+        # convert similarity to affinity
+        # need to do it group by group since the softmax normalization would be different
+        affinity = [
+            do_softmax(similarity[:, -gv.shape[2] :, validity[gi]])
+            if gv is not None
+            else None
+            for gi, gv in enumerate(candidate_value)
+        ]
+        # some values can be have all False validity. Weed them out.
+        affinity = [
+            aff if aff is None or aff.shape[-1] > 0 else None for aff in affinity
+        ]
+        # readout the values
+        prototype_value = [
+            self._readout(affinity[gi], gv) if affinity[gi] is not None else None
+            for gi, gv in enumerate(candidate_value)
+        ]
+        # readout the shrinkage term
+        prototype_shrinkage = (
+            self._readout(affinity[0], candidate_shrinkage)
+            if candidate_shrinkage is not None
+            else None
+        )
+        return prototype_key, prototype_value, prototype_shrinkage

tracker/model/__init__.py ADDED Viewed

File without changes

tracker/model/aggregate.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import torch
+import torch.nn.functional as F
+# Soft aggregation from STM
+def aggregate(prob, dim, return_logits=False):
+    new_prob = torch.cat(
+        [torch.prod(1 - prob, dim=dim, keepdim=True), prob], dim
+    ).clamp(1e-7, 1 - 1e-7)
+    logits = torch.log((new_prob / (1 - new_prob)))
+    prob = F.softmax(logits, dim=dim)
+    if return_logits:
+        return logits, prob
+    else:
+        return prob

tracker/model/cbam.py ADDED Viewed

	@@ -0,0 +1,119 @@

+# Modified from https://github.com/Jongchan/attention-module/blob/master/MODELS/cbam.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class BasicConv(nn.Module):
+    def __init__(
+        self,
+        in_planes,
+        out_planes,
+        kernel_size,
+        stride=1,
+        padding=0,
+        dilation=1,
+        groups=1,
+        bias=True,
+    ):
+        super(BasicConv, self).__init__()
+        self.out_channels = out_planes
+        self.conv = nn.Conv2d(
+            in_planes,
+            out_planes,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=groups,
+            bias=bias,
+        )
+    def forward(self, x):
+        x = self.conv(x)
+        return x
+class Flatten(nn.Module):
+    def forward(self, x):
+        return x.view(x.size(0), -1)
+class ChannelGate(nn.Module):
+    def __init__(self, gate_channels, reduction_ratio=16, pool_types=["avg", "max"]):
+        super(ChannelGate, self).__init__()
+        self.gate_channels = gate_channels
+        self.mlp = nn.Sequential(
+            Flatten(),
+            nn.Linear(gate_channels, gate_channels // reduction_ratio),
+            nn.ReLU(),
+            nn.Linear(gate_channels // reduction_ratio, gate_channels),
+        )
+        self.pool_types = pool_types
+    def forward(self, x):
+        channel_att_sum = None
+        for pool_type in self.pool_types:
+            if pool_type == "avg":
+                avg_pool = F.avg_pool2d(
+                    x, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))
+                )
+                channel_att_raw = self.mlp(avg_pool)
+            elif pool_type == "max":
+                max_pool = F.max_pool2d(
+                    x, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))
+                )
+                channel_att_raw = self.mlp(max_pool)
+            if channel_att_sum is None:
+                channel_att_sum = channel_att_raw
+            else:
+                channel_att_sum = channel_att_sum + channel_att_raw
+        scale = torch.sigmoid(channel_att_sum).unsqueeze(2).unsqueeze(3).expand_as(x)
+        return x * scale
+class ChannelPool(nn.Module):
+    def forward(self, x):
+        return torch.cat(
+            (torch.max(x, 1)[0].unsqueeze(1), torch.mean(x, 1).unsqueeze(1)), dim=1
+        )
+class SpatialGate(nn.Module):
+    def __init__(self):
+        super(SpatialGate, self).__init__()
+        kernel_size = 7
+        self.compress = ChannelPool()
+        self.spatial = BasicConv(
+            2, 1, kernel_size, stride=1, padding=(kernel_size - 1) // 2
+        )
+    def forward(self, x):
+        x_compress = self.compress(x)
+        x_out = self.spatial(x_compress)
+        scale = torch.sigmoid(x_out)  # broadcasting
+        return x * scale
+class CBAM(nn.Module):
+    def __init__(
+        self,
+        gate_channels,
+        reduction_ratio=16,
+        pool_types=["avg", "max"],
+        no_spatial=False,
+    ):
+        super(CBAM, self).__init__()
+        self.ChannelGate = ChannelGate(gate_channels, reduction_ratio, pool_types)
+        self.no_spatial = no_spatial
+        if not no_spatial:
+            self.SpatialGate = SpatialGate()
+    def forward(self, x):
+        x_out = self.ChannelGate(x)
+        if not self.no_spatial:
+            x_out = self.SpatialGate(x_out)
+        return x_out

tracker/model/group_modules.py ADDED Viewed

	@@ -0,0 +1,92 @@

+"""
+Group-specific modules
+They handle features that also depends on the mask.
+Features are typically of shape
+    batch_size * num_objects * num_channels * H * W
+All of them are permutation equivariant w.r.t. to the num_objects dimension
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+def interpolate_groups(g, ratio, mode, align_corners):
+    batch_size, num_objects = g.shape[:2]
+    g = F.interpolate(
+        g.flatten(start_dim=0, end_dim=1),
+        scale_factor=ratio,
+        mode=mode,
+        align_corners=align_corners,
+    )
+    g = g.view(batch_size, num_objects, *g.shape[1:])
+    return g
+def upsample_groups(g, ratio=2, mode="bilinear", align_corners=False):
+    return interpolate_groups(g, ratio, mode, align_corners)
+def downsample_groups(g, ratio=1 / 2, mode="area", align_corners=None):
+    return interpolate_groups(g, ratio, mode, align_corners)
+class GConv2D(nn.Conv2d):
+    def forward(self, g):
+        batch_size, num_objects = g.shape[:2]
+        g = super().forward(g.flatten(start_dim=0, end_dim=1))
+        return g.view(batch_size, num_objects, *g.shape[1:])
+class GroupResBlock(nn.Module):
+    def __init__(self, in_dim, out_dim):
+        super().__init__()
+        if in_dim == out_dim:
+            self.downsample = None
+        else:
+            self.downsample = GConv2D(in_dim, out_dim, kernel_size=3, padding=1)
+        self.conv1 = GConv2D(in_dim, out_dim, kernel_size=3, padding=1)
+        self.conv2 = GConv2D(out_dim, out_dim, kernel_size=3, padding=1)
+    def forward(self, g):
+        out_g = self.conv1(F.relu(g))
+        out_g = self.conv2(F.relu(out_g))
+        if self.downsample is not None:
+            g = self.downsample(g)
+        return out_g + g
+class MainToGroupDistributor(nn.Module):
+    def __init__(self, x_transform=None, method="cat", reverse_order=False):
+        super().__init__()
+        self.x_transform = x_transform
+        self.method = method
+        self.reverse_order = reverse_order
+    def forward(self, x, g):
+        num_objects = g.shape[1]
+        if self.x_transform is not None:
+            x = self.x_transform(x)
+        if self.method == "cat":
+            if self.reverse_order:
+                g = torch.cat(
+                    [g, x.unsqueeze(1).expand(-1, num_objects, -1, -1, -1)], 2
+                )
+            else:
+                g = torch.cat(
+                    [x.unsqueeze(1).expand(-1, num_objects, -1, -1, -1), g], 2
+                )
+        elif self.method == "add":
+            g = x.unsqueeze(1).expand(-1, num_objects, -1, -1, -1) + g
+        else:
+            raise NotImplementedError
+        return g

tracker/model/losses.py ADDED Viewed

	@@ -0,0 +1,76 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from collections import defaultdict
+def dice_loss(input_mask, cls_gt):
+    num_objects = input_mask.shape[1]
+    losses = []
+    for i in range(num_objects):
+        mask = input_mask[:, i].flatten(start_dim=1)
+        # background not in mask, so we add one to cls_gt
+        gt = (cls_gt == (i + 1)).float().flatten(start_dim=1)
+        numerator = 2 * (mask * gt).sum(-1)
+        denominator = mask.sum(-1) + gt.sum(-1)
+        loss = 1 - (numerator + 1) / (denominator + 1)
+        losses.append(loss)
+    return torch.cat(losses).mean()
+# https://stackoverflow.com/questions/63735255/how-do-i-compute-bootstrapped-cross-entropy-loss-in-pytorch
+class BootstrappedCE(nn.Module):
+    def __init__(self, start_warm, end_warm, top_p=0.15):
+        super().__init__()
+        self.start_warm = start_warm
+        self.end_warm = end_warm
+        self.top_p = top_p
+    def forward(self, input, target, it):
+        if it < self.start_warm:
+            return F.cross_entropy(input, target), 1.0
+        raw_loss = F.cross_entropy(input, target, reduction="none").view(-1)
+        num_pixels = raw_loss.numel()
+        if it > self.end_warm:
+            this_p = self.top_p
+        else:
+            this_p = self.top_p + (1 - self.top_p) * (
+                (self.end_warm - it) / (self.end_warm - self.start_warm)
+            )
+        loss, _ = torch.topk(raw_loss, int(num_pixels * this_p), sorted=False)
+        return loss.mean(), this_p
+class LossComputer:
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.bce = BootstrappedCE(config["start_warm"], config["end_warm"])
+    def compute(self, data, num_objects, it):
+        losses = defaultdict(int)
+        b, t = data["rgb"].shape[:2]
+        losses["total_loss"] = 0
+        for ti in range(1, t):
+            for bi in range(b):
+                loss, p = self.bce(
+                    data[f"logits_{ti}"][bi : bi + 1, : num_objects[bi] + 1],
+                    data["cls_gt"][bi : bi + 1, ti, 0],
+                    it,
+                )
+                losses["p"] += p / b / (t - 1)
+                losses[f"ce_loss_{ti}"] += loss / b
+            losses["total_loss"] += losses["ce_loss_%d" % ti]
+            losses[f"dice_loss_{ti}"] = dice_loss(
+                data[f"masks_{ti}"], data["cls_gt"][:, ti, 0]
+            )
+            losses["total_loss"] += losses[f"dice_loss_{ti}"]
+        return losses

tracker/model/memory_util.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import math
+import numpy as np
+import torch
+from typing import Optional
+def get_similarity(mk, ms, qk, qe):
+    # used for training/inference and memory reading/memory potentiation
+    # mk: B x CK x [N]    - Memory keys
+    # ms: B x  1 x [N]    - Memory shrinkage
+    # qk: B x CK x [HW/P] - Query keys
+    # qe: B x CK x [HW/P] - Query selection
+    # Dimensions in [] are flattened
+    CK = mk.shape[1]
+    mk = mk.flatten(start_dim=2)
+    ms = ms.flatten(start_dim=1).unsqueeze(2) if ms is not None else None
+    qk = qk.flatten(start_dim=2)
+    qe = qe.flatten(start_dim=2) if qe is not None else None
+    if qe is not None:
+        # See appendix for derivation
+        # or you can just trust me ヽ(ー_ー )ノ
+        mk = mk.transpose(1, 2)
+        a_sq = mk.pow(2) @ qe
+        two_ab = 2 * (mk @ (qk * qe))
+        b_sq = (qe * qk.pow(2)).sum(1, keepdim=True)
+        similarity = -a_sq + two_ab - b_sq
+    else:
+        # similar to STCN if we don't have the selection term
+        a_sq = mk.pow(2).sum(1).unsqueeze(2)
+        two_ab = 2 * (mk.transpose(1, 2) @ qk)
+        similarity = -a_sq + two_ab
+    if ms is not None:
+        similarity = similarity * ms / math.sqrt(CK)  # B*N*HW
+    else:
+        similarity = similarity / math.sqrt(CK)  # B*N*HW
+    return similarity
+def do_softmax(
+    similarity, top_k: Optional[int] = None, inplace=False, return_usage=False
+):
+    # normalize similarity with top-k softmax
+    # similarity: B x N x [HW/P]
+    # use inplace with care
+    if top_k is not None:
+        values, indices = torch.topk(similarity, k=top_k, dim=1)
+        x_exp = values.exp_()
+        x_exp /= torch.sum(x_exp, dim=1, keepdim=True)
+        if inplace:
+            similarity.zero_().scatter_(1, indices, x_exp)  # B*N*HW
+            affinity = similarity
+        else:
+            affinity = torch.zeros_like(similarity).scatter_(
+                1, indices, x_exp
+            )  # B*N*HW
+    else:
+        maxes = torch.max(similarity, dim=1, keepdim=True)[0]
+        x_exp = torch.exp(similarity - maxes)
+        x_exp_sum = torch.sum(x_exp, dim=1, keepdim=True)
+        affinity = x_exp / x_exp_sum
+        indices = None
+    if return_usage:
+        return affinity, affinity.sum(dim=2)
+    return affinity
+def get_affinity(mk, ms, qk, qe):
+    # shorthand used in training with no top-k
+    similarity = get_similarity(mk, ms, qk, qe)
+    affinity = do_softmax(similarity)
+    return affinity
+def readout(affinity, mv):
+    B, CV, T, H, W = mv.shape
+    mo = mv.view(B, CV, T * H * W)
+    mem = torch.bmm(mo, affinity)
+    mem = mem.view(B, CV, H, W)
+    return mem

tracker/model/modules.py ADDED Viewed

	@@ -0,0 +1,261 @@

+"""
+modules.py - This file stores the rather boring network blocks.
+x - usually means features that only depends on the image
+g - usually means features that also depends on the mask.
+    They might have an extra "group" or "num_objects" dimension, hence
+    batch_size * num_objects * num_channels * H * W
+The trailing number of a variable usually denote the stride
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from model.group_modules import *
+from model import resnet
+from model.cbam import CBAM
+class FeatureFusionBlock(nn.Module):
+    def __init__(self, x_in_dim, g_in_dim, g_mid_dim, g_out_dim):
+        super().__init__()
+        self.distributor = MainToGroupDistributor()
+        self.block1 = GroupResBlock(x_in_dim + g_in_dim, g_mid_dim)
+        self.attention = CBAM(g_mid_dim)
+        self.block2 = GroupResBlock(g_mid_dim, g_out_dim)
+    def forward(self, x, g):
+        batch_size, num_objects = g.shape[:2]
+        g = self.distributor(x, g)
+        g = self.block1(g)
+        r = self.attention(g.flatten(start_dim=0, end_dim=1))
+        r = r.view(batch_size, num_objects, *r.shape[1:])
+        g = self.block2(g + r)
+        return g
+class HiddenUpdater(nn.Module):
+    # Used in the decoder, multi-scale feature + GRU
+    def __init__(self, g_dims, mid_dim, hidden_dim):
+        super().__init__()
+        self.hidden_dim = hidden_dim
+        self.g16_conv = GConv2D(g_dims[0], mid_dim, kernel_size=1)
+        self.g8_conv = GConv2D(g_dims[1], mid_dim, kernel_size=1)
+        self.g4_conv = GConv2D(g_dims[2], mid_dim, kernel_size=1)
+        self.transform = GConv2D(
+            mid_dim + hidden_dim, hidden_dim * 3, kernel_size=3, padding=1
+        )
+        nn.init.xavier_normal_(self.transform.weight)
+    def forward(self, g, h):
+        g = (
+            self.g16_conv(g[0])
+            + self.g8_conv(downsample_groups(g[1], ratio=1 / 2))
+            + self.g4_conv(downsample_groups(g[2], ratio=1 / 4))
+        )
+        g = torch.cat([g, h], 2)
+        # defined slightly differently than standard GRU,
+        # namely the new value is generated before the forget gate.
+        # might provide better gradient but frankly it was initially just an
+        # implementation error that I never bothered fixing
+        values = self.transform(g)
+        forget_gate = torch.sigmoid(values[:, :, : self.hidden_dim])
+        update_gate = torch.sigmoid(values[:, :, self.hidden_dim : self.hidden_dim * 2])
+        new_value = torch.tanh(values[:, :, self.hidden_dim * 2 :])
+        new_h = forget_gate * h * (1 - update_gate) + update_gate * new_value
+        return new_h
+class HiddenReinforcer(nn.Module):
+    # Used in the value encoder, a single GRU
+    def __init__(self, g_dim, hidden_dim):
+        super().__init__()
+        self.hidden_dim = hidden_dim
+        self.transform = GConv2D(
+            g_dim + hidden_dim, hidden_dim * 3, kernel_size=3, padding=1
+        )
+        nn.init.xavier_normal_(self.transform.weight)
+    def forward(self, g, h):
+        g = torch.cat([g, h], 2)
+        # defined slightly differently than standard GRU,
+        # namely the new value is generated before the forget gate.
+        # might provide better gradient but frankly it was initially just an
+        # implementation error that I never bothered fixing
+        values = self.transform(g)
+        forget_gate = torch.sigmoid(values[:, :, : self.hidden_dim])
+        update_gate = torch.sigmoid(values[:, :, self.hidden_dim : self.hidden_dim * 2])
+        new_value = torch.tanh(values[:, :, self.hidden_dim * 2 :])
+        new_h = forget_gate * h * (1 - update_gate) + update_gate * new_value
+        return new_h
+class ValueEncoder(nn.Module):
+    def __init__(self, value_dim, hidden_dim, single_object=False):
+        super().__init__()
+        self.single_object = single_object
+        network = resnet.resnet18(pretrained=True, extra_dim=1 if single_object else 2)
+        self.conv1 = network.conv1
+        self.bn1 = network.bn1
+        self.relu = network.relu  # 1/2, 64
+        self.maxpool = network.maxpool
+        self.layer1 = network.layer1  # 1/4, 64
+        self.layer2 = network.layer2  # 1/8, 128
+        self.layer3 = network.layer3  # 1/16, 256
+        self.distributor = MainToGroupDistributor()
+        self.fuser = FeatureFusionBlock(1024, 256, value_dim, value_dim)
+        if hidden_dim > 0:
+            self.hidden_reinforce = HiddenReinforcer(value_dim, hidden_dim)
+        else:
+            self.hidden_reinforce = None
+    def forward(self, image, image_feat_f16, h, masks, others, is_deep_update=True):
+        # image_feat_f16 is the feature from the key encoder
+        if not self.single_object:
+            g = torch.stack([masks, others], 2)
+        else:
+            g = masks.unsqueeze(2)
+        g = self.distributor(image, g)
+        batch_size, num_objects = g.shape[:2]
+        g = g.flatten(start_dim=0, end_dim=1)
+        g = self.conv1(g)
+        g = self.bn1(g)  # 1/2, 64
+        g = self.maxpool(g)  # 1/4, 64
+        g = self.relu(g)
+        g = self.layer1(g)  # 1/4
+        g = self.layer2(g)  # 1/8
+        g = self.layer3(g)  # 1/16
+        g = g.view(batch_size, num_objects, *g.shape[1:])
+        g = self.fuser(image_feat_f16, g)
+        if is_deep_update and self.hidden_reinforce is not None:
+            h = self.hidden_reinforce(g, h)
+        return g, h
+class KeyEncoder(nn.Module):
+    def __init__(self):
+        super().__init__()
+        network = resnet.resnet50(pretrained=True)
+        self.conv1 = network.conv1
+        self.bn1 = network.bn1
+        self.relu = network.relu  # 1/2, 64
+        self.maxpool = network.maxpool
+        self.res2 = network.layer1  # 1/4, 256
+        self.layer2 = network.layer2  # 1/8, 512
+        self.layer3 = network.layer3  # 1/16, 1024
+    def forward(self, f):
+        x = self.conv1(f)
+        x = self.bn1(x)
+        x = self.relu(x)  # 1/2, 64
+        x = self.maxpool(x)  # 1/4, 64
+        f4 = self.res2(x)  # 1/4, 256
+        f8 = self.layer2(f4)  # 1/8, 512
+        f16 = self.layer3(f8)  # 1/16, 1024
+        return f16, f8, f4
+class UpsampleBlock(nn.Module):
+    def __init__(self, skip_dim, g_up_dim, g_out_dim, scale_factor=2):
+        super().__init__()
+        self.skip_conv = nn.Conv2d(skip_dim, g_up_dim, kernel_size=3, padding=1)
+        self.distributor = MainToGroupDistributor(method="add")
+        self.out_conv = GroupResBlock(g_up_dim, g_out_dim)
+        self.scale_factor = scale_factor
+    def forward(self, skip_f, up_g):
+        skip_f = self.skip_conv(skip_f)
+        g = upsample_groups(up_g, ratio=self.scale_factor)
+        g = self.distributor(skip_f, g)
+        g = self.out_conv(g)
+        return g
+class KeyProjection(nn.Module):
+    def __init__(self, in_dim, keydim):
+        super().__init__()
+        self.key_proj = nn.Conv2d(in_dim, keydim, kernel_size=3, padding=1)
+        # shrinkage
+        self.d_proj = nn.Conv2d(in_dim, 1, kernel_size=3, padding=1)
+        # selection
+        self.e_proj = nn.Conv2d(in_dim, keydim, kernel_size=3, padding=1)
+        nn.init.orthogonal_(self.key_proj.weight.data)
+        nn.init.zeros_(self.key_proj.bias.data)
+    def forward(self, x, need_s, need_e):
+        shrinkage = self.d_proj(x) ** 2 + 1 if (need_s) else None
+        selection = torch.sigmoid(self.e_proj(x)) if (need_e) else None
+        return self.key_proj(x), shrinkage, selection
+class Decoder(nn.Module):
+    def __init__(self, val_dim, hidden_dim):
+        super().__init__()
+        self.fuser = FeatureFusionBlock(1024, val_dim + hidden_dim, 512, 512)
+        if hidden_dim > 0:
+            self.hidden_update = HiddenUpdater([512, 256, 256 + 1], 256, hidden_dim)
+        else:
+            self.hidden_update = None
+        self.up_16_8 = UpsampleBlock(512, 512, 256)  # 1/16 -> 1/8
+        self.up_8_4 = UpsampleBlock(256, 256, 256)  # 1/8 -> 1/4
+        self.pred = nn.Conv2d(256, 1, kernel_size=3, padding=1, stride=1)
+    def forward(self, f16, f8, f4, hidden_state, memory_readout, h_out=True):
+        batch_size, num_objects = memory_readout.shape[:2]
+        if self.hidden_update is not None:
+            g16 = self.fuser(f16, torch.cat([memory_readout, hidden_state], 2))
+        else:
+            g16 = self.fuser(f16, memory_readout)
+        g8 = self.up_16_8(f8, g16)
+        g4 = self.up_8_4(f4, g8)
+        logits = self.pred(F.relu(g4.flatten(start_dim=0, end_dim=1)))
+        if h_out and self.hidden_update is not None:
+            g4 = torch.cat(
+                [g4, logits.view(batch_size, num_objects, 1, *logits.shape[-2:])], 2
+            )
+            hidden_state = self.hidden_update([g16, g8, g4], hidden_state)
+        else:
+            hidden_state = None
+        logits = F.interpolate(
+            logits, scale_factor=4, mode="bilinear", align_corners=False
+        )
+        logits = logits.view(batch_size, num_objects, *logits.shape[-2:])
+        return hidden_state, logits

tracker/model/network.py ADDED Viewed

	@@ -0,0 +1,241 @@

+"""
+This file defines XMem, the highest level nn.Module interface
+During training, it is used by trainer.py
+During evaluation, it is used by inference_core.py
+It further depends on modules.py which gives more detailed implementations of sub-modules
+"""
+import torch
+import torch.nn as nn
+from model.aggregate import aggregate
+from model.modules import *
+from model.memory_util import *
+class XMem(nn.Module):
+    def __init__(self, config, model_path=None, map_location=None):
+        """
+        model_path/map_location are used in evaluation only
+        map_location is for converting models saved in cuda to cpu
+        """
+        super().__init__()
+        model_weights = self.init_hyperparameters(config, model_path, map_location)
+        self.single_object = config.get("single_object", False)
+        print(f"Single object mode: {self.single_object}")
+        self.key_encoder = KeyEncoder()
+        self.value_encoder = ValueEncoder(
+            self.value_dim, self.hidden_dim, self.single_object
+        )
+        # Projection from f16 feature space to key/value space
+        self.key_proj = KeyProjection(1024, self.key_dim)
+        self.decoder = Decoder(self.value_dim, self.hidden_dim)
+        if model_weights is not None:
+            self.load_weights(model_weights, init_as_zero_if_needed=True)
+    def encode_key(self, frame, need_sk=True, need_ek=True):
+        # Determine input shape
+        if len(frame.shape) == 5:
+            # shape is b*t*c*h*w
+            need_reshape = True
+            b, t = frame.shape[:2]
+            # flatten so that we can feed them into a 2D CNN
+            frame = frame.flatten(start_dim=0, end_dim=1)
+        elif len(frame.shape) == 4:
+            # shape is b*c*h*w
+            need_reshape = False
+        else:
+            raise NotImplementedError
+        f16, f8, f4 = self.key_encoder(frame)
+        key, shrinkage, selection = self.key_proj(f16, need_sk, need_ek)
+        if need_reshape:
+            # B*C*T*H*W
+            key = key.view(b, t, *key.shape[-3:]).transpose(1, 2).contiguous()
+            if shrinkage is not None:
+                shrinkage = (
+                    shrinkage.view(b, t, *shrinkage.shape[-3:])
+                    .transpose(1, 2)
+                    .contiguous()
+                )
+            if selection is not None:
+                selection = (
+                    selection.view(b, t, *selection.shape[-3:])
+                    .transpose(1, 2)
+                    .contiguous()
+                )
+            # B*T*C*H*W
+            f16 = f16.view(b, t, *f16.shape[-3:])
+            f8 = f8.view(b, t, *f8.shape[-3:])
+            f4 = f4.view(b, t, *f4.shape[-3:])
+        return key, shrinkage, selection, f16, f8, f4
+    def encode_value(self, frame, image_feat_f16, h16, masks, is_deep_update=True):
+        num_objects = masks.shape[1]
+        if num_objects != 1:
+            others = torch.cat(
+                [
+                    torch.sum(
+                        masks[:, [j for j in range(num_objects) if i != j]],
+                        dim=1,
+                        keepdim=True,
+                    )
+                    for i in range(num_objects)
+                ],
+                1,
+            )
+        else:
+            others = torch.zeros_like(masks)
+        g16, h16 = self.value_encoder(
+            frame, image_feat_f16, h16, masks, others, is_deep_update
+        )
+        return g16, h16
+    # Used in training only.
+    # This step is replaced by MemoryManager in test time
+    def read_memory(
+        self, query_key, query_selection, memory_key, memory_shrinkage, memory_value
+    ):
+        """
+        query_key       : B * CK * H * W
+        query_selection : B * CK * H * W
+        memory_key      : B * CK * T * H * W
+        memory_shrinkage: B * 1  * T * H * W
+        memory_value    : B * num_objects * CV * T * H * W
+        """
+        batch_size, num_objects = memory_value.shape[:2]
+        memory_value = memory_value.flatten(start_dim=1, end_dim=2)
+        affinity = get_affinity(
+            memory_key, memory_shrinkage, query_key, query_selection
+        )
+        memory = readout(affinity, memory_value)
+        memory = memory.view(
+            batch_size, num_objects, self.value_dim, *memory.shape[-2:]
+        )
+        return memory
+    def segment(
+        self,
+        multi_scale_features,
+        memory_readout,
+        hidden_state,
+        selector=None,
+        h_out=True,
+        strip_bg=True,
+    ):
+        hidden_state, logits = self.decoder(
+            *multi_scale_features, hidden_state, memory_readout, h_out=h_out
+        )
+        prob = torch.sigmoid(logits)
+        if selector is not None:
+            prob = prob * selector
+        logits, prob = aggregate(prob, dim=1, return_logits=True)
+        if strip_bg:
+            # Strip away the background
+            prob = prob[:, 1:]
+        return hidden_state, logits, prob
+    def forward(self, mode, *args, **kwargs):
+        if mode == "encode_key":
+            return self.encode_key(*args, **kwargs)
+        elif mode == "encode_value":
+            return self.encode_value(*args, **kwargs)
+        elif mode == "read_memory":
+            return self.read_memory(*args, **kwargs)
+        elif mode == "segment":
+            return self.segment(*args, **kwargs)
+        else:
+            raise NotImplementedError
+    def init_hyperparameters(self, config, model_path=None, map_location=None):
+        """
+        Init three hyperparameters: key_dim, value_dim, and hidden_dim
+        If model_path is provided, we load these from the model weights
+        The actual parameters are then updated to the config in-place
+        Otherwise we load it either from the config or default
+        """
+        if model_path is not None:
+            # load the model and key/value/hidden dimensions with some hacks
+            # config is updated with the loaded parameters
+            model_weights = torch.load(model_path, map_location=map_location)
+            self.key_dim = model_weights["key_proj.key_proj.weight"].shape[0]
+            self.value_dim = model_weights[
+                "value_encoder.fuser.block2.conv2.weight"
+            ].shape[0]
+            self.disable_hidden = (
+                "decoder.hidden_update.transform.weight" not in model_weights
+            )
+            if self.disable_hidden:
+                self.hidden_dim = 0
+            else:
+                self.hidden_dim = (
+                    model_weights["decoder.hidden_update.transform.weight"].shape[0]
+                    // 3
+                )
+            print(
+                f"Hyperparameters read from the model weights: "
+                f"C^k={self.key_dim}, C^v={self.value_dim}, C^h={self.hidden_dim}"
+            )
+        else:
+            model_weights = None
+            # load dimensions from config or default
+            if "key_dim" not in config:
+                self.key_dim = 64
+                print(f"key_dim not found in config. Set to default {self.key_dim}")
+            else:
+                self.key_dim = config["key_dim"]
+            if "value_dim" not in config:
+                self.value_dim = 512
+                print(f"value_dim not found in config. Set to default {self.value_dim}")
+            else:
+                self.value_dim = config["value_dim"]
+            if "hidden_dim" not in config:
+                self.hidden_dim = 64
+                print(
+                    f"hidden_dim not found in config. Set to default {self.hidden_dim}"
+                )
+            else:
+                self.hidden_dim = config["hidden_dim"]
+            self.disable_hidden = self.hidden_dim <= 0
+        config["key_dim"] = self.key_dim
+        config["value_dim"] = self.value_dim
+        config["hidden_dim"] = self.hidden_dim
+        return model_weights
+    def load_weights(self, src_dict, init_as_zero_if_needed=False):
+        # Maps SO weight (without other_mask) to MO weight (with other_mask)
+        for k in list(src_dict.keys()):
+            if k == "value_encoder.conv1.weight":
+                if src_dict[k].shape[1] == 4:
+                    print("Converting weights from single object to multiple objects.")
+                    pads = torch.zeros((64, 1, 7, 7), device=src_dict[k].device)
+                    if not init_as_zero_if_needed:
+                        print("Randomly initialized padding.")
+                        nn.init.orthogonal_(pads)
+                    else:
+                        print("Zero-initialized padding.")
+                    src_dict[k] = torch.cat([src_dict[k], pads], 1)
+        self.load_state_dict(src_dict)

tracker/model/resnet.py ADDED Viewed

	@@ -0,0 +1,191 @@

+"""
+resnet.py - A modified ResNet structure
+We append extra channels to the first conv by some network surgery
+"""
+from collections import OrderedDict
+import math
+import torch
+import torch.nn as nn
+from torch.utils import model_zoo
+def load_weights_add_extra_dim(target, source_state, extra_dim=1):
+    new_dict = OrderedDict()
+    for k1, v1 in target.state_dict().items():
+        if not "num_batches_tracked" in k1:
+            if k1 in source_state:
+                tar_v = source_state[k1]
+                if v1.shape != tar_v.shape:
+                    # Init the new segmentation channel with zeros
+                    # print(v1.shape, tar_v.shape)
+                    c, _, w, h = v1.shape
+                    pads = torch.zeros((c, extra_dim, w, h), device=tar_v.device)
+                    nn.init.orthogonal_(pads)
+                    tar_v = torch.cat([tar_v, pads], 1)
+                new_dict[k1] = tar_v
+    target.load_state_dict(new_dict)
+model_urls = {
+    "resnet18": "https://download.pytorch.org/models/resnet18-5c106cde.pth",
+    "resnet50": "https://download.pytorch.org/models/resnet50-19c8e357.pth",
+}
+def conv3x3(in_planes, out_planes, stride=1, dilation=1):
+    return nn.Conv2d(
+        in_planes,
+        out_planes,
+        kernel_size=3,
+        stride=stride,
+        padding=dilation,
+        dilation=dilation,
+        bias=False,
+    )
+class BasicBlock(nn.Module):
+    expansion = 1
+    def __init__(self, inplanes, planes, stride=1, downsample=None, dilation=1):
+        super(BasicBlock, self).__init__()
+        self.conv1 = conv3x3(inplanes, planes, stride=stride, dilation=dilation)
+        self.bn1 = nn.BatchNorm2d(planes)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = conv3x3(planes, planes, stride=1, dilation=dilation)
+        self.bn2 = nn.BatchNorm2d(planes)
+        self.downsample = downsample
+        self.stride = stride
+    def forward(self, x):
+        residual = x
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        if self.downsample is not None:
+            residual = self.downsample(x)
+        out += residual
+        out = self.relu(out)
+        return out
+class Bottleneck(nn.Module):
+    expansion = 4
+    def __init__(self, inplanes, planes, stride=1, downsample=None, dilation=1):
+        super(Bottleneck, self).__init__()
+        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes)
+        self.conv2 = nn.Conv2d(
+            planes,
+            planes,
+            kernel_size=3,
+            stride=stride,
+            dilation=dilation,
+            padding=dilation,
+            bias=False,
+        )
+        self.bn2 = nn.BatchNorm2d(planes)
+        self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
+        self.bn3 = nn.BatchNorm2d(planes * 4)
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = downsample
+        self.stride = stride
+    def forward(self, x):
+        residual = x
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+        out = self.conv3(out)
+        out = self.bn3(out)
+        if self.downsample is not None:
+            residual = self.downsample(x)
+        out += residual
+        out = self.relu(out)
+        return out
+class ResNet(nn.Module):
+    def __init__(self, block, layers=(3, 4, 23, 3), extra_dim=0):
+        self.inplanes = 64
+        super(ResNet, self).__init__()
+        self.conv1 = nn.Conv2d(
+            3 + extra_dim, 64, kernel_size=7, stride=2, padding=3, bias=False
+        )
+        self.bn1 = nn.BatchNorm2d(64)
+        self.relu = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+        self.layer1 = self._make_layer(block, 64, layers[0])
+        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
+        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
+        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+                m.weight.data.normal_(0, math.sqrt(2.0 / n))
+            elif isinstance(m, nn.BatchNorm2d):
+                m.weight.data.fill_(1)
+                m.bias.data.zero_()
+    def _make_layer(self, block, planes, blocks, stride=1, dilation=1):
+        downsample = None
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            downsample = nn.Sequential(
+                nn.Conv2d(
+                    self.inplanes,
+                    planes * block.expansion,
+                    kernel_size=1,
+                    stride=stride,
+                    bias=False,
+                ),
+                nn.BatchNorm2d(planes * block.expansion),
+            )
+        layers = [block(self.inplanes, planes, stride, downsample)]
+        self.inplanes = planes * block.expansion
+        for i in range(1, blocks):
+            layers.append(block(self.inplanes, planes, dilation=dilation))
+        return nn.Sequential(*layers)
+def resnet18(pretrained=True, extra_dim=0):
+    model = ResNet(BasicBlock, [2, 2, 2, 2], extra_dim)
+    if pretrained:
+        load_weights_add_extra_dim(
+            model, model_zoo.load_url(model_urls["resnet18"]), extra_dim
+        )
+    return model
+def resnet50(pretrained=True, extra_dim=0):
+    model = ResNet(Bottleneck, [3, 4, 6, 3], extra_dim)
+    if pretrained:
+        load_weights_add_extra_dim(
+            model, model_zoo.load_url(model_urls["resnet50"]), extra_dim
+        )
+    return model

tracker/model/trainer.py ADDED Viewed

	@@ -0,0 +1,302 @@

+"""
+trainer.py - warpper and utility functions for network training
+Compute loss, back-prop, update parameters, logging, etc.
+"""
+import datetime
+import os
+import time
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from model.network import XMem
+from model.losses import LossComputer
+from util.log_integrator import Integrator
+from util.image_saver import pool_pairs
+class XMemTrainer:
+    def __init__(self, config, logger=None, save_path=None, local_rank=0, world_size=1):
+        self.config = config
+        self.num_frames = config["num_frames"]
+        self.num_ref_frames = config["num_ref_frames"]
+        self.deep_update_prob = config["deep_update_prob"]
+        self.local_rank = local_rank
+        self.XMem = nn.parallel.DistributedDataParallel(
+            XMem(config).cuda(),
+            device_ids=[local_rank],
+            output_device=local_rank,
+            broadcast_buffers=False,
+        )
+        # Set up logger when local_rank=0
+        self.logger = logger
+        self.save_path = save_path
+        if logger is not None:
+            self.last_time = time.time()
+            self.logger.log_string(
+                "model_size",
+                str(sum([param.nelement() for param in self.XMem.parameters()])),
+            )
+        self.train_integrator = Integrator(
+            self.logger, distributed=True, local_rank=local_rank, world_size=world_size
+        )
+        self.loss_computer = LossComputer(config)
+        self.train()
+        self.optimizer = optim.AdamW(
+            filter(lambda p: p.requires_grad, self.XMem.parameters()),
+            lr=config["lr"],
+            weight_decay=config["weight_decay"],
+        )
+        self.scheduler = optim.lr_scheduler.MultiStepLR(
+            self.optimizer, config["steps"], config["gamma"]
+        )
+        if config["amp"]:
+            self.scaler = torch.cuda.amp.GradScaler()
+        # Logging info
+        self.log_text_interval = config["log_text_interval"]
+        self.log_image_interval = config["log_image_interval"]
+        self.save_network_interval = config["save_network_interval"]
+        self.save_checkpoint_interval = config["save_checkpoint_interval"]
+        if config["debug"]:
+            self.log_text_interval = self.log_image_interval = 1
+    def do_pass(self, data, max_it, it=0):
+        # No need to store the gradient outside training
+        torch.set_grad_enabled(self._is_train)
+        for k, v in data.items():
+            if type(v) != list and type(v) != dict and type(v) != int:
+                data[k] = v.cuda(non_blocking=True)
+        out = {}
+        frames = data["rgb"]
+        first_frame_gt = data["first_frame_gt"].float()
+        b = frames.shape[0]
+        num_filled_objects = [o.item() for o in data["info"]["num_objects"]]
+        num_objects = first_frame_gt.shape[2]
+        selector = data["selector"].unsqueeze(2).unsqueeze(2)
+        global_avg = 0
+        with torch.cuda.amp.autocast(enabled=self.config["amp"]):
+            # image features never change, compute once
+            key, shrinkage, selection, f16, f8, f4 = self.XMem("encode_key", frames)
+            filler_one = torch.zeros(1, dtype=torch.int64)
+            hidden = torch.zeros(
+                (b, num_objects, self.config["hidden_dim"], *key.shape[-2:])
+            )
+            v16, hidden = self.XMem(
+                "encode_value", frames[:, 0], f16[:, 0], hidden, first_frame_gt[:, 0]
+            )
+            values = v16.unsqueeze(3)  # add the time dimension
+            for ti in range(1, self.num_frames):
+                if ti <= self.num_ref_frames:
+                    ref_values = values
+                    ref_keys = key[:, :, :ti]
+                    ref_shrinkage = (
+                        shrinkage[:, :, :ti] if shrinkage is not None else None
+                    )
+                else:
+                    # pick num_ref_frames random frames
+                    # this is not very efficient but I think we would
+                    # need broadcasting in gather which we don't have
+                    indices = [
+                        torch.cat(
+                            [
+                                filler_one,
+                                torch.randperm(ti - 1)[: self.num_ref_frames - 1] + 1,
+                            ]
+                        )
+                        for _ in range(b)
+                    ]
+                    ref_values = torch.stack(
+                        [values[bi, :, :, indices[bi]] for bi in range(b)], 0
+                    )
+                    ref_keys = torch.stack(
+                        [key[bi, :, indices[bi]] for bi in range(b)], 0
+                    )
+                    ref_shrinkage = (
+                        torch.stack(
+                            [shrinkage[bi, :, indices[bi]] for bi in range(b)], 0
+                        )
+                        if shrinkage is not None
+                        else None
+                    )
+                # Segment frame ti
+                memory_readout = self.XMem(
+                    "read_memory",
+                    key[:, :, ti],
+                    selection[:, :, ti] if selection is not None else None,
+                    ref_keys,
+                    ref_shrinkage,
+                    ref_values,
+                )
+                hidden, logits, masks = self.XMem(
+                    "segment",
+                    (f16[:, ti], f8[:, ti], f4[:, ti]),
+                    memory_readout,
+                    hidden,
+                    selector,
+                    h_out=(ti < (self.num_frames - 1)),
+                )
+                # No need to encode the last frame
+                if ti < (self.num_frames - 1):
+                    is_deep_update = np.random.rand() < self.deep_update_prob
+                    v16, hidden = self.XMem(
+                        "encode_value",
+                        frames[:, ti],
+                        f16[:, ti],
+                        hidden,
+                        masks,
+                        is_deep_update=is_deep_update,
+                    )
+                    values = torch.cat([values, v16.unsqueeze(3)], 3)
+                out[f"masks_{ti}"] = masks
+                out[f"logits_{ti}"] = logits
+            if self._do_log or self._is_train:
+                losses = self.loss_computer.compute(
+                    {**data, **out}, num_filled_objects, it
+                )
+                # Logging
+                if self._do_log:
+                    self.integrator.add_dict(losses)
+                    if self._is_train:
+                        if it % self.log_image_interval == 0 and it != 0:
+                            if self.logger is not None:
+                                images = {**data, **out}
+                                size = (384, 384)
+                                self.logger.log_cv2(
+                                    "train/pairs",
+                                    pool_pairs(images, size, num_filled_objects),
+                                    it,
+                                )
+            if self._is_train:
+                if (it) % self.log_text_interval == 0 and it != 0:
+                    time_spent = time.time() - self.last_time
+                    if self.logger is not None:
+                        self.logger.log_scalar(
+                            "train/lr", self.scheduler.get_last_lr()[0], it
+                        )
+                        self.logger.log_metrics(
+                            "train", "time", (time_spent) / self.log_text_interval, it
+                        )
+                    global_avg = 0.5 * (global_avg) + 0.5 * (time_spent)
+                    eta_seconds = global_avg * (max_it - it) / 100
+                    eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                    print(f"ETA: {eta_string}")
+                    self.last_time = time.time()
+                    self.train_integrator.finalize("train", it)
+                    self.train_integrator.reset_except_hooks()
+                if it % self.save_network_interval == 0 and it != 0:
+                    if self.logger is not None:
+                        self.save_network(it)
+                if it % self.save_checkpoint_interval == 0 and it != 0:
+                    if self.logger is not None:
+                        self.save_checkpoint(it)
+        # Backward pass
+        self.optimizer.zero_grad(set_to_none=True)
+        if self.config["amp"]:
+            self.scaler.scale(losses["total_loss"]).backward()
+            self.scaler.step(self.optimizer)
+            self.scaler.update()
+        else:
+            losses["total_loss"].backward()
+            self.optimizer.step()
+        self.scheduler.step()
+    def save_network(self, it):
+        if self.save_path is None:
+            print("Saving has been disabled.")
+            return
+        os.makedirs(os.path.dirname(self.save_path), exist_ok=True)
+        model_path = f"{self.save_path}_{it}.pth"
+        torch.save(self.XMem.module.state_dict(), model_path)
+        print(f"Network saved to {model_path}.")
+    def save_checkpoint(self, it):
+        if self.save_path is None:
+            print("Saving has been disabled.")
+            return
+        os.makedirs(os.path.dirname(self.save_path), exist_ok=True)
+        checkpoint_path = f"{self.save_path}_checkpoint_{it}.pth"
+        checkpoint = {
+            "it": it,
+            "network": self.XMem.module.state_dict(),
+            "optimizer": self.optimizer.state_dict(),
+            "scheduler": self.scheduler.state_dict(),
+        }
+        torch.save(checkpoint, checkpoint_path)
+        print(f"Checkpoint saved to {checkpoint_path}.")
+    def load_checkpoint(self, path):
+        # This method loads everything and should be used to resume training
+        map_location = "cuda:%d" % self.local_rank
+        checkpoint = torch.load(path, map_location={"cuda:0": map_location})
+        it = checkpoint["it"]
+        network = checkpoint["network"]
+        optimizer = checkpoint["optimizer"]
+        scheduler = checkpoint["scheduler"]
+        map_location = "cuda:%d" % self.local_rank
+        self.XMem.module.load_state_dict(network)
+        self.optimizer.load_state_dict(optimizer)
+        self.scheduler.load_state_dict(scheduler)
+        print("Network weights, optimizer states, and scheduler states loaded.")
+        return it
+    def load_network_in_memory(self, src_dict):
+        self.XMem.module.load_weights(src_dict)
+        print("Network weight loaded from memory.")
+    def load_network(self, path):
+        # This method loads only the network weight and should be used to load a pretrained model
+        map_location = "cuda:%d" % self.local_rank
+        src_dict = torch.load(path, map_location={"cuda:0": map_location})
+        self.load_network_in_memory(src_dict)
+        print(f"Network weight loaded from {path}")
+    def train(self):
+        self._is_train = True
+        self._do_log = True
+        self.integrator = self.train_integrator
+        self.XMem.eval()
+        return self
+    def val(self):
+        self._is_train = False
+        self._do_log = True
+        self.XMem.eval()
+        return self
+    def test(self):
+        self._is_train = False
+        self._do_log = False
+        self.XMem.eval()
+        return self

tracker/util/__init__.py ADDED Viewed

File without changes

tracker/util/mask_mapper.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import numpy as np
+import torch
+def all_to_onehot(masks, labels):
+    if len(masks.shape) == 3:
+        Ms = np.zeros(
+            (len(labels), masks.shape[0], masks.shape[1], masks.shape[2]),
+            dtype=np.uint8,
+        )
+    else:
+        Ms = np.zeros((len(labels), masks.shape[0], masks.shape[1]), dtype=np.uint8)
+    for ni, l in enumerate(labels):
+        Ms[ni] = (masks == l).astype(np.uint8)
+    return Ms
+class MaskMapper:
+    """
+    This class is used to convert a indexed-mask to a one-hot representation.
+    It also takes care of remapping non-continuous indices
+    It has two modes:
+        1. Default. Only masks with new indices are supposed to go into the remapper.
+        This is also the case for YouTubeVOS.
+        i.e., regions with index 0 are not "background", but "don't care".
+        2. Exhaustive. Regions with index 0 are considered "background".
+        Every single pixel is considered to be "labeled".
+    """
+    def __init__(self):
+        self.labels = []
+        self.remappings = {}
+        # if coherent, no mapping is required
+        self.coherent = True
+    def clear_labels(self):
+        self.labels = []
+        self.remappings = {}
+        # if coherent, no mapping is required
+        self.coherent = True
+    def convert_mask(self, mask, exhaustive=False):
+        # mask is in index representation, H*W numpy array
+        labels = np.unique(mask).astype(np.uint8)
+        labels = labels[labels != 0].tolist()
+        new_labels = list(set(labels) - set(self.labels))
+        if not exhaustive:
+            assert len(new_labels) == len(
+                labels
+            ), "Old labels found in non-exhaustive mode"
+        # add new remappings
+        for i, l in enumerate(new_labels):
+            self.remappings[l] = i + len(self.labels) + 1
+            if self.coherent and i + len(self.labels) + 1 != l:
+                self.coherent = False
+        if exhaustive:
+            new_mapped_labels = range(1, len(self.labels) + len(new_labels) + 1)
+        else:
+            if self.coherent:
+                new_mapped_labels = new_labels
+            else:
+                new_mapped_labels = range(
+                    len(self.labels) + 1, len(self.labels) + len(new_labels) + 1
+                )
+        self.labels.extend(new_labels)
+        mask = torch.from_numpy(all_to_onehot(mask, self.labels)).float()
+        # mask num_objects*H*W
+        return mask, new_mapped_labels
+    def remap_index_mask(self, mask):
+        # mask is in index representation, H*W numpy array
+        if self.coherent:
+            return mask
+        new_mask = np.zeros_like(mask)
+        for l, i in self.remappings.items():
+            new_mask[mask == i] = l
+        return new_mask

tracker/util/range_transform.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import torchvision.transforms as transforms
+im_mean = (124, 116, 104)
+im_normalization = transforms.Normalize(
+    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
+)
+inv_im_trans = transforms.Normalize(
+    mean=[-0.485 / 0.229, -0.456 / 0.224, -0.406 / 0.225],
+    std=[1 / 0.229, 1 / 0.224, 1 / 0.225],
+)

tracker/util/tensor_util.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import torch.nn.functional as F
+def compute_tensor_iu(seg, gt):
+    intersection = (seg & gt).float().sum()
+    union = (seg | gt).float().sum()
+    return intersection, union
+def compute_tensor_iou(seg, gt):
+    intersection, union = compute_tensor_iu(seg, gt)
+    iou = (intersection + 1e-6) / (union + 1e-6)
+    return iou
+# STM
+def pad_divide_by(in_img, d):
+    h, w = in_img.shape[-2:]
+    if h % d > 0:
+        new_h = h + d - h % d
+    else:
+        new_h = h
+    if w % d > 0:
+        new_w = w + d - w % d
+    else:
+        new_w = w
+    lh, uh = int((new_h - h) / 2), int(new_h - h) - int((new_h - h) / 2)
+    lw, uw = int((new_w - w) / 2), int(new_w - w) - int((new_w - w) / 2)
+    pad_array = (int(lw), int(uw), int(lh), int(uh))
+    out = F.pad(in_img, pad_array)
+    return out, pad_array
+def unpad(img, pad):
+    if len(img.shape) == 4:
+        if pad[2] + pad[3] > 0:
+            img = img[:, :, pad[2] : -pad[3], :]
+        if pad[0] + pad[1] > 0:
+            img = img[:, :, :, pad[0] : -pad[1]]
+    elif len(img.shape) == 3:
+        if pad[2] + pad[3] > 0:
+            img = img[:, pad[2] : -pad[3], :]
+        if pad[0] + pad[1] > 0:
+            img = img[:, :, pad[0] : -pad[1]]
+    else:
+        raise NotImplementedError
+    return img

utils/base_segmenter.py ADDED Viewed

	@@ -0,0 +1,149 @@

+import torch
+import numpy as np
+class BaseSegmenter:
+    def __init__(self, sam_pt_checkpoint, sam_onnx_checkpoint, model_type, device="cuda:0"):
+        """
+        device: model device
+        SAM_checkpoint: path of SAM checkpoint
+        model_type: vit_b, vit_l, vit_h, vit_t
+        """
+        print(f"Initializing BaseSegmenter to {device}")
+        assert model_type in [
+            "vit_b",
+            "vit_l",
+            "vit_h",
+            "vit_t",
+        ], "model_type must be vit_b, vit_l, vit_h or vit_t"
+        self.device = device
+        self.torch_dtype = torch.float16 if "cuda" in device else torch.float32
+        if (model_type == "vit_t"):
+            from mobile_sam import sam_model_registry, SamPredictor
+            from onnxruntime import InferenceSession
+            self.ort_session = InferenceSession(sam_onnx_checkpoint)
+            self.predict = self.predict_onnx
+        else:
+            from segment_anything import sam_model_registry, SamPredictor
+            self.predict = self.predict_pt
+        self.model = sam_model_registry[model_type](checkpoint=sam_pt_checkpoint)
+        self.model.to(device=self.device)
+        self.predictor = SamPredictor(self.model)
+        self.embedded = False
+    @torch.no_grad()
+    def set_image(self, image: np.ndarray):
+        # PIL.open(image_path) 3channel: RGB
+        # image embedding: avoid encode the same image multiple times
+        self.orignal_image = image
+        if self.embedded:
+            print("repeat embedding, please reset_image.")
+            return
+        self.predictor.set_image(image)
+        self.image_embedding = self.predictor.get_image_embedding().cpu().numpy()
+        self.embedded = True
+        return
+    @torch.no_grad()
+    def reset_image(self):
+        # reset image embeding
+        self.predictor.reset_image()
+        self.embedded = False
+    def predict_pt(self, prompts, mode, multimask=True):
+        """
+        image: numpy array, h, w, 3
+        prompts: dictionary, 3 keys: 'point_coords', 'point_labels', 'mask_input'
+        prompts['point_coords']: numpy array [N,2]
+        prompts['point_labels']: numpy array [1,N]
+        prompts['mask_input']: numpy array [1,256,256]
+        mode: 'point' (points only), 'mask' (mask only), 'both' (consider both)
+        mask_outputs: True (return 3 masks), False (return 1 mask only)
+        whem mask_outputs=True, mask_input=logits[np.argmax(scores), :, :][None, :, :]
+        """
+        assert (
+            self.embedded
+        ), "prediction is called before set_image (feature embedding)."
+        assert mode in ["point", "mask", "both"], "mode must be point, mask, or both"
+        if mode == "point":
+            masks, scores, logits = self.predictor.predict(
+                point_coords=prompts["point_coords"],
+                point_labels=prompts["point_labels"],
+                multimask_output=multimask,
+            )
+        elif mode == "mask":
+            masks, scores, logits = self.predictor.predict(
+                mask_input=prompts["mask_input"], multimask_output=multimask
+            )
+        elif mode == "both":  # both
+            masks, scores, logits = self.predictor.predict(
+                point_coords=prompts["point_coords"],
+                point_labels=prompts["point_labels"],
+                mask_input=prompts["mask_input"],
+                multimask_output=multimask,
+            )
+        else:
+            raise ("Not implement now!")
+        # masks (n, h, w), scores (n,), logits (n, 256, 256)
+        return masks, scores, logits
+    def predict_onnx(self, prompts, mode, multimask=True):
+        """
+        image: numpy array, h, w, 3
+        prompts: dictionary, 3 keys: 'point_coords', 'point_labels', 'mask_input'
+        prompts['point_coords']: numpy array [N,2]
+        prompts['point_labels']: numpy array [1,N]
+        prompts['mask_input']: numpy array [1,256,256]
+        mode: 'point' (points only), 'mask' (mask only), 'both' (consider both)
+        mask_outputs: True (return 3 masks), False (return 1 mask only)
+        whem mask_outputs=True, mask_input=logits[np.argmax(scores), :, :][None, :, :]
+        """
+        assert (
+            self.embedded
+        ), "prediction is called before set_image (feature embedding)."
+        assert mode in ["point", "mask", "both"], "mode must be point, mask, or both"
+        if mode == "point":
+            ort_inputs = {
+                "image_embeddings": self.image_embedding,
+                "point_coords": prompts["point_coords"],
+                "point_labels": prompts["point_labels"],
+                "mask_input": np.zeros((1, 1, 256, 256), dtype=np.float32),
+                "has_mask_input": np.zeros(1, dtype=np.float32),
+                "orig_im_size": prompts["orig_im_size"],
+            }
+            masks, scores, logits = self.ort_session.run(None, ort_inputs)
+            masks = masks > self.predictor.model.mask_threshold
+        elif mode == "mask":
+            ort_inputs = {
+                "image_embeddings": self.image_embedding,
+                "point_coords": np.zeros((len(prompts["point_labels"]), 2), dtype=np.float32),
+                "point_labels": prompts["point_labels"],
+                "mask_input": prompts["mask_input"],
+                "has_mask_input": np.ones(1, dtype=np.float32),
+                "orig_im_size": prompts["orig_im_size"],
+            }
+            masks, scores, logits = self.ort_session.run(None, ort_inputs)
+            masks = masks > self.predictor.model.mask_threshold
+        elif mode == "both":  # both
+            ort_inputs = {
+                "image_embeddings": self.image_embedding,
+                "point_coords": prompts["point_coords"],
+                "point_labels": prompts["point_labels"],
+                "mask_input": prompts["mask_input"],
+                "has_mask_input": np.ones(1, dtype=np.float32),
+                "orig_im_size": prompts["orig_im_size"],
+            }
+            masks, scores, logits = self.ort_session.run(None, ort_inputs)
+            masks = masks > self.predictor.model.mask_threshold
+        else:
+            raise ("Not implement now!")
+        # masks (n, h, w), scores (n,), logits (n, 256, 256)
+        return masks[0], scores[0], logits[0]

utils/blur.py ADDED Viewed

	@@ -0,0 +1,81 @@

+import os
+import cv2
+import numpy as np
+# resize frames
+def resize_frames(frames, size=None):
+    """
+    size: (w, h)
+    """
+    if size is not None:
+        frames = [cv2.resize(f, size) for f in frames]
+        frames = np.stack(frames, 0)
+    return frames
+# resize frames
+def resize_masks(masks, size=None):
+    """
+    size: (w, h)
+    """
+    if size is not None:
+        masks = [np.expand_dims(cv2.resize(m, size), 2) for m in masks]
+        masks = np.stack(masks, 0)
+    return masks
+# apply gaussian blur to mask with defined strength
+def apply_blur(frame, strength):
+    blurred = cv2.GaussianBlur(frame, (strength, strength), 0)
+    return blurred
+# blur frames
+def blur_frames_and_write(
+    frames, masks, ratio, strength, dilate_radius=15, fps=30, output_path="blurred.mp4"
+):
+    assert frames.shape[:3] == masks.shape, "different size between frames and masks"
+    assert ratio > 0 and ratio <= 1, "ratio must in (0, 1]"
+    # --------------------
+    # pre-processing
+    # --------------------
+    masks = masks.copy()
+    masks = np.clip(masks, 0, 1)
+    kernel = cv2.getStructuringElement(2, (dilate_radius, dilate_radius))
+    masks = np.stack([cv2.dilate(mask, kernel) for mask in masks], 0)
+    T, H, W = masks.shape
+    masks = np.expand_dims(masks, axis=3)  # expand to T, H, W, 1
+    # size: (w, h)
+    if ratio == 1:
+        size = (W, H)
+        binary_masks = masks
+    else:
+        size = [int(W * ratio), int(H * ratio)]
+        size = [
+            si + 1 if si % 2 > 0 else si for si in size
+        ]  # only consider even values
+        # shortest side should be larger than 50
+        if min(size) < 50:
+            ratio = 50.0 / min(H, W)
+            size = [int(W * ratio), int(H * ratio)]
+        binary_masks = resize_masks(masks, tuple(size))
+        frames = resize_frames(frames, tuple(size))  # T, H, W, 3
+    if not os.path.exists(os.path.dirname(output_path)):
+        os.makedirs(os.path.dirname(output_path))
+    writer = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*"mp4v"), fps, size)
+    for frame, mask in zip(frames, binary_masks):
+        blurred_frame = apply_blur(frame, strength)
+        masked = cv2.bitwise_or(blurred_frame, blurred_frame, mask=mask)
+        processed = np.where(masked == (0, 0, 0), frame, masked)
+        writer.write(processed[:, :, ::-1])
+    writer.release()
+    return output_path

utils/interact_tools.py ADDED Viewed

	@@ -0,0 +1,109 @@

+from PIL import Image
+import numpy as np
+from .base_segmenter import BaseSegmenter
+from .painter import mask_painter, point_painter
+mask_color = 3
+mask_alpha = 0.7
+contour_color = 1
+contour_width = 5
+point_color_ne = 8
+point_color_ps = 50
+point_alpha = 0.9
+point_radius = 15
+contour_color = 2
+contour_width = 5
+class SamControler:
+    def __init__(self, sam_pt_checkpoint, sam_onnx_checkpoint, model_type, device):
+        """
+        initialize sam controler
+        """
+        self.sam_controler = BaseSegmenter(sam_pt_checkpoint, sam_onnx_checkpoint, model_type, device)
+        self.onnx = model_type == "vit_t"
+    def first_frame_click(
+        self,
+        image: np.ndarray,
+        points: np.ndarray,
+        labels: np.ndarray,
+        multimask=True,
+        mask_color=3,
+    ):
+        """
+        it is used in first frame in video
+        return: mask, logit, painted image(mask+point)
+        """
+        # self.sam_controler.set_image(image)
+        neg_flag = labels[-1]
+        if self.onnx:
+            onnx_coord = np.concatenate([points, np.array([[0.0, 0.0]])], axis=0)[None, :, :]
+            onnx_label = np.concatenate([labels, np.array([-1])], axis=0)[None, :].astype(np.float32)
+            onnx_coord = self.sam_controler.predictor.transform.apply_coords(onnx_coord, image.shape[:2]).astype(np.float32)
+            prompts = {
+                "point_coords": onnx_coord,
+                "point_labels": onnx_label,
+                "orig_im_size": np.array(image.shape[:2], dtype=np.float32),
+            }
+        else:
+            prompts = {
+                "point_coords": points,
+                "point_labels": labels,
+            }
+        if neg_flag == 1:
+            # find positive
+            masks, scores, logits = self.sam_controler.predict(
+                prompts, "point", multimask
+            )
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+            prompts["mask_input"] = np.expand_dims(logit[None, :, :], 0)
+            masks, scores, logits = self.sam_controler.predict(
+                prompts, "both", multimask
+            )
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+        else:
+            # find neg
+            masks, scores, logits = self.sam_controler.predict(
+                prompts, "point", multimask
+            )
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+        assert len(points) == len(labels)
+        painted_image = mask_painter(
+            image,
+            mask.astype("uint8"),
+            mask_color,
+            mask_alpha,
+            contour_color,
+            contour_width,
+        )
+        painted_image = point_painter(
+            painted_image,
+            np.squeeze(points[np.argwhere(labels > 0)], axis=1),
+            point_color_ne,
+            point_alpha,
+            point_radius,
+            contour_color,
+            contour_width,
+        )
+        painted_image = point_painter(
+            painted_image,
+            np.squeeze(points[np.argwhere(labels < 1)], axis=1),
+            point_color_ps,
+            point_alpha,
+            point_radius,
+            contour_color,
+            contour_width,
+        )
+        painted_image = Image.fromarray(painted_image)
+        return mask, logit, painted_image

utils/painter.py ADDED Viewed

	@@ -0,0 +1,360 @@

+import cv2
+import numpy as np
+from PIL import Image
+def colormap(rgb=True):
+    color_list = np.array(
+        [
+            0.000,
+            0.000,
+            0.000,
+            1.000,
+            1.000,
+            1.000,
+            1.000,
+            0.498,
+            0.313,
+            0.392,
+            0.581,
+            0.929,
+            0.000,
+            0.447,
+            0.741,
+            0.850,
+            0.325,
+            0.098,
+            0.929,
+            0.694,
+            0.125,
+            0.494,
+            0.184,
+            0.556,
+            0.466,
+            0.674,
+            0.188,
+            0.301,
+            0.745,
+            0.933,
+            0.635,
+            0.078,
+            0.184,
+            0.300,
+            0.300,
+            0.300,
+            0.600,
+            0.600,
+            0.600,
+            1.000,
+            0.000,
+            0.000,
+            1.000,
+            0.500,
+            0.000,
+            0.749,
+            0.749,
+            0.000,
+            0.000,
+            1.000,
+            0.000,
+            0.000,
+            0.000,
+            1.000,
+            0.667,
+            0.000,
+            1.000,
+            0.333,
+            0.333,
+            0.000,
+            0.333,
+            0.667,
+            0.000,
+            0.333,
+            1.000,
+            0.000,
+            0.667,
+            0.333,
+            0.000,
+            0.667,
+            0.667,
+            0.000,
+            0.667,
+            1.000,
+            0.000,
+            1.000,
+            0.333,
+            0.000,
+            1.000,
+            0.667,
+            0.000,
+            1.000,
+            1.000,
+            0.000,
+            0.000,
+            0.333,
+            0.500,
+            0.000,
+            0.667,
+            0.500,
+            0.000,
+            1.000,
+            0.500,
+            0.333,
+            0.000,
+            0.500,
+            0.333,
+            0.333,
+            0.500,
+            0.333,
+            0.667,
+            0.500,
+            0.333,
+            1.000,
+            0.500,
+            0.667,
+            0.000,
+            0.500,
+            0.667,
+            0.333,
+            0.500,
+            0.667,
+            0.667,
+            0.500,
+            0.667,
+            1.000,
+            0.500,
+            1.000,
+            0.000,
+            0.500,
+            1.000,
+            0.333,
+            0.500,
+            1.000,
+            0.667,
+            0.500,
+            1.000,
+            1.000,
+            0.500,
+            0.000,
+            0.333,
+            1.000,
+            0.000,
+            0.667,
+            1.000,
+            0.000,
+            1.000,
+            1.000,
+            0.333,
+            0.000,
+            1.000,
+            0.333,
+            0.333,
+            1.000,
+            0.333,
+            0.667,
+            1.000,
+            0.333,
+            1.000,
+            1.000,
+            0.667,
+            0.000,
+            1.000,
+            0.667,
+            0.333,
+            1.000,
+            0.667,
+            0.667,
+            1.000,
+            0.667,
+            1.000,
+            1.000,
+            1.000,
+            0.000,
+            1.000,
+            1.000,
+            0.333,
+            1.000,
+            1.000,
+            0.667,
+            1.000,
+            0.167,
+            0.000,
+            0.000,
+            0.333,
+            0.000,
+            0.000,
+            0.500,
+            0.000,
+            0.000,
+            0.667,
+            0.000,
+            0.000,
+            0.833,
+            0.000,
+            0.000,
+            1.000,
+            0.000,
+            0.000,
+            0.000,
+            0.167,
+            0.000,
+            0.000,
+            0.333,
+            0.000,
+            0.000,
+            0.500,
+            0.000,
+            0.000,
+            0.667,
+            0.000,
+            0.000,
+            0.833,
+            0.000,
+            0.000,
+            1.000,
+            0.000,
+            0.000,
+            0.000,
+            0.167,
+            0.000,
+            0.000,
+            0.333,
+            0.000,
+            0.000,
+            0.500,
+            0.000,
+            0.000,
+            0.667,
+            0.000,
+            0.000,
+            0.833,
+            0.000,
+            0.000,
+            1.000,
+            0.143,
+            0.143,
+            0.143,
+            0.286,
+            0.286,
+            0.286,
+            0.429,
+            0.429,
+            0.429,
+            0.571,
+            0.571,
+            0.571,
+            0.714,
+            0.714,
+            0.714,
+            0.857,
+            0.857,
+            0.857,
+        ]
+    ).astype(np.float32)
+    color_list = color_list.reshape((-1, 3)) * 255
+    if not rgb:
+        color_list = color_list[:, ::-1]
+    return color_list
+color_list = colormap()
+color_list = color_list.astype("uint8").tolist()
+def vis_add_mask(image, mask, color, alpha):
+    color = np.array(color_list[color])
+    mask = mask > 0.5
+    image[mask] = image[mask] * (1 - alpha) + color * alpha
+    return image.astype("uint8")
+def point_painter(
+    input_image,
+    input_points,
+    point_color=5,
+    point_alpha=0.9,
+    point_radius=15,
+    contour_color=2,
+    contour_width=5,
+):
+    h, w = input_image.shape[:2]
+    point_mask = np.zeros((h, w)).astype("uint8")
+    for point in input_points:
+        point_mask[point[1], point[0]] = 1
+    kernel = cv2.getStructuringElement(2, (point_radius, point_radius))
+    point_mask = cv2.dilate(point_mask, kernel)
+    contour_radius = (contour_width - 1) // 2
+    dist_transform_fore = cv2.distanceTransform(point_mask, cv2.DIST_L2, 3)
+    dist_transform_back = cv2.distanceTransform(1 - point_mask, cv2.DIST_L2, 3)
+    dist_map = dist_transform_fore - dist_transform_back
+    # ...:::!!!:::...
+    contour_radius += 2
+    contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+    contour_mask = contour_mask / np.max(contour_mask)
+    contour_mask[contour_mask > 0.5] = 1.0
+    # paint mask
+    painted_image = vis_add_mask(
+        input_image.copy(), point_mask, point_color, point_alpha
+    )
+    # paint contour
+    painted_image = vis_add_mask(
+        painted_image.copy(), 1 - contour_mask, contour_color, 1
+    )
+    return painted_image
+def mask_painter(
+    input_image,
+    input_mask,
+    mask_color=5,
+    mask_alpha=0.7,
+    contour_color=1,
+    contour_width=3,
+):
+    assert (
+        input_image.shape[:2] == input_mask.shape
+    ), "different shape between image and mask"
+    # 0: background, 1: foreground
+    mask = np.clip(input_mask, 0, 1)
+    contour_radius = (contour_width - 1) // 2
+    dist_transform_fore = cv2.distanceTransform(mask, cv2.DIST_L2, 3)
+    dist_transform_back = cv2.distanceTransform(1 - mask, cv2.DIST_L2, 3)
+    dist_map = dist_transform_fore - dist_transform_back
+    # ...:::!!!:::...
+    contour_radius += 2
+    contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+    contour_mask = contour_mask / np.max(contour_mask)
+    contour_mask[contour_mask > 0.5] = 1.0
+    # paint mask
+    painted_image = vis_add_mask(
+        input_image.copy(), mask.copy(), mask_color, mask_alpha
+    )
+    # paint contour
+    painted_image = vis_add_mask(
+        painted_image.copy(), 1 - contour_mask, contour_color, 1
+    )
+    return painted_image
+def background_remover(input_image, input_mask):
+    """
+    input_image: H, W, 3, np.array
+    input_mask: H, W, np.array
+    image_wo_background: PIL.Image
+    """
+    assert (
+        input_image.shape[:2] == input_mask.shape
+    ), "different shape between image and mask"
+    # 0: background, 1: foreground
+    mask = np.expand_dims(np.clip(input_mask, 0, 1), axis=2) * 255
+    image_wo_background = np.concatenate([input_image, mask], axis=2)  # H, W, 4
+    image_wo_background = Image.fromarray(image_wo_background).convert("RGBA")
+    return image_wo_background