Spaces:

moondream
/

video-redaction

Running on Zero

App Files Files Community

vikhyatk commited on Jan 30

Commit

195fd31

0 Parent(s):

initial commit

Browse files

Files changed (6) hide show

.gitignore +51 -0
README.md +199 -0
app.py +214 -0
main.py +742 -0
packages.txt +2 -0
requirements.txt +15 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,51 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual Environment
+venv/
+env/
+ENV/
+.venv/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+# Project specific
+inputs/*
+outputs/*
+!inputs/.gitkeep
+!outputs/.gitkeep
+inputs/
+outputs/
+# Model files
+*.pth
+*.onnx
+*.pt
+# Logs
+*.log
+certificate.pem

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+title: Video Redaction
+emoji: 🐨
+colorFrom: yellow
+colorTo: gray
+sdk: gradio
+sdk_version: 5.14.0
+app_file: app.py
+pinned: false
+---
+# Promptable Video Redaction with Moondream
+This tool uses Moondream 2B, a powerful yet lightweight vision-language model, to detect and redact objects from videos. Moondream can recognize a wide variety of objects, people, text, and more with high accuracy while being much smaller than traditional models.
+[Try it now.](https://huggingface.co/spaces/moondream/promptable-video-redaction)
+## About Moondream
+Moondream is a tiny yet powerful vision-language model that can analyze images and answer questions about them. It's designed to be lightweight and efficient while maintaining high accuracy. Some key features:
+- Only 2B parameters
+- Fast inference with minimal resource requirements
+- Supports CPU and GPU execution
+- Open source and free to use
+- Can detect almost anything you can describe in natural language
+Links:
+- [GitHub Repository](https://github.com/vikhyat/moondream)
+- [Hugging Face](https://huggingface.co/vikhyatk/moondream2)
+- [Build with Moondream](http://docs.moondream.ai/)
+## Features
+- Real-time object detection in videos using Moondream
+- Multiple visualization styles:
+  - Censor: Black boxes over detected objects
+  - Bounding Box: Traditional bounding boxes with labels
+  - Hitmarker: Call of Duty style crosshair markers
+- Optional grid-based detection for improved accuracy
+- Flexible object type detection using natural language
+- Frame-by-frame processing with IoU-based merging
+- Batch processing of multiple videos
+- Web-compatible output format
+- User-friendly web interface
+- Command-line interface for automation
+## Requirements
+- Python 3.8+
+- OpenCV (cv2)
+- PyTorch
+- Transformers
+- Pillow (PIL)
+- tqdm
+- ffmpeg
+- numpy
+- gradio (for web interface)
+## Installation
+1. Clone this repository and create a new virtual environment
+```bash
+git clone https://github.com/vikhyat/moondream/blob/main/recipes/promptable-video-redaction
+python -m venv .venv
+source .venv/bin/activate
+```
+2. Install the required packages:
+```bash
+pip install -r requirements.txt
+```
+3. Install ffmpeg:
+   - On Ubuntu/Debian: `sudo apt-get install ffmpeg libvips`
+   - On macOS: `brew install ffmpeg`
+   - On Windows: Download from [ffmpeg.org](https://ffmpeg.org/download.html)
+> Downloading libvips for Windows requires some additional steps, see [here](https://docs.moondream.ai/quick-start)
+## Usage
+### Web Interface
+1. Start the web interface:
+```bash
+python app.py
+```
+2. Open the provided URL in your browser
+3. Use the interface to:
+   - Upload your video
+   - Specify what to censor (e.g., face, logo, text)
+   - Adjust processing speed and quality
+   - Configure grid size for detection
+   - Process and download the censored video
+### Command Line Interface
+1. Create an `inputs` directory in the same folder as the script:
+```bash
+mkdir inputs
+```
+2. Place your video files in the `inputs` directory. Supported formats:
+   - .mp4
+   - .avi
+   - .mov
+   - .mkv
+   - .webm
+3. Run the script:
+```bash
+python main.py
+```
+### Optional Arguments:
+- `--test`: Process only first 3 seconds of each video (useful for testing detection settings)
+```bash
+python main.py --test
+```
+- `--preset`: Choose FFmpeg encoding preset (affects output quality vs. speed)
+```bash
+python main.py --preset ultrafast  # Fastest, lower quality
+python main.py --preset veryslow   # Slowest, highest quality
+```
+- `--detect`: Specify what object type to detect (using natural language)
+```bash
+python main.py --detect person     # Detect people
+python main.py --detect "red car"  # Detect red cars
+python main.py --detect "person wearing a hat"  # Detect people with hats
+```
+- `--box-style`: Choose visualization style
+```bash
+python main.py --box-style censor     # Black boxes (default)
+python main.py --box-style bounding-box       # Bounding box-style boxes with labels
+python main.py --box-style hitmarker  # COD-style hitmarkers
+```
+- `--rows` and `--cols`: Enable grid-based detection by splitting frames
+```bash
+python main.py --rows 2 --cols 2   # Split each frame into 2x2 grid
+python main.py --rows 3 --cols 3   # Split each frame into 3x3 grid
+```
+You can combine arguments:
+```bash
+python main.py --detect "person wearing sunglasses" --box-style bounding-box --test --preset "fast" --rows 2 --cols 2
+```
+### Visualization Styles
+The tool supports three different visualization styles for detected objects:
+1. **Censor** (default)
+   - Places solid black rectangles over detected objects
+   - Best for privacy and content moderation
+   - Completely obscures the detected region
+2. **Bounding Box**
+   - Traditional object detection style
+   - Red bounding box around detected objects
+   - Label showing object type above the box
+   - Good for analysis and debugging
+3. **Hitmarker**
+   - Call of Duty inspired visualization
+   - White crosshair marker at center of detected objects
+   - Small label above the marker
+   - Stylistic choice for gaming-inspired visualization
+Choose the style that best fits your use case using the `--box-style` argument.
+## Output
+Processed videos will be saved in the `outputs` directory with the format:
+`[style]_[object_type]_[original_filename].mp4`
+For example:
+- `censor_face_video.mp4`
+- `bounding-box_person_video.mp4`
+- `hitmarker_car_video.mp4`
+The output videos will include:
+- Original video content
+- Selected visualization style for detected objects
+- Web-compatible H.264 encoding
+## Notes
+- Processing time depends on video length, grid size, and GPU availability
+- GPU is strongly recommended for faster processing
+- Requires sufficient disk space for temporary files
+- Detection quality varies based on video quality and Moondream's ability to recognize the specified object
+- Grid-based detection impacts performance significantly - use only when needed
+- Web interface shows progress updates and errors
+- Choose visualization style based on your use case
+- Moondream can detect almost anything you can describe in natural language

app.py ADDED Viewed

	@@ -0,0 +1,214 @@

+#!/usr/bin/env python3
+import gradio as gr
+import os
+from main import load_moondream, process_video
+import tempfile
+import shutil
+import torch
+# import spaces
+# Get absolute path to workspace root
+WORKSPACE_ROOT = os.path.dirname(os.path.abspath(__file__))
+# Check CUDA availability
+print(f"Is CUDA available: {torch.cuda.is_available()}")
+# We want to get True
+print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
+# GPU Name
+# Initialize model globally for reuse
+print("Loading Moondream model...")
+model, tokenizer = load_moondream()
+# Uncomment for Hugging Face Spaces
+# @spaces.GPU(duration=120)
+def process_video_file(
+    video_file, detect_keyword, box_style, ffmpeg_preset, rows, cols, test_mode
+):
+    """Process a video file through the Gradio interface."""
+    try:
+        if not video_file:
+            raise gr.Error("Please upload a video file")
+        # Ensure input/output directories exist using absolute paths
+        inputs_dir = os.path.join(WORKSPACE_ROOT, "inputs")
+        outputs_dir = os.path.join(WORKSPACE_ROOT, "outputs")
+        os.makedirs(inputs_dir, exist_ok=True)
+        os.makedirs(outputs_dir, exist_ok=True)
+        # Copy uploaded video to inputs directory
+        video_filename = f"input_{os.path.basename(video_file)}"
+        input_video_path = os.path.join(inputs_dir, video_filename)
+        shutil.copy2(video_file, input_video_path)
+        try:
+            # Process the video
+            output_path = process_video(
+                input_video_path,
+                detect_keyword,
+                test_mode=test_mode,
+                ffmpeg_preset=ffmpeg_preset,
+                rows=rows,
+                cols=cols,
+                box_style=box_style,
+            )
+            # Verify output exists and is readable
+            if not output_path or not os.path.exists(output_path):
+                print(f"Warning: Output path {output_path} does not exist")
+                # Try to find the output based on expected naming convention
+                expected_output = os.path.join(
+                    outputs_dir, f"{box_style}_{detect_keyword}_{video_filename}"
+                )
+                if os.path.exists(expected_output):
+                    output_path = expected_output
+                else:
+                    # Try searching in outputs directory for any matching file
+                    matching_files = [
+                        f
+                        for f in os.listdir(outputs_dir)
+                        if f.startswith(f"{box_style}_{detect_keyword}_")
+                    ]
+                    if matching_files:
+                        output_path = os.path.join(outputs_dir, matching_files[0])
+                    else:
+                        raise gr.Error("Failed to locate output video")
+            # Convert output path to absolute path if it isn't already
+            if not os.path.isabs(output_path):
+                output_path = os.path.join(WORKSPACE_ROOT, output_path)
+            print(f"Returning output path: {output_path}")
+            return output_path
+        finally:
+            # Clean up input file
+            try:
+                if os.path.exists(input_video_path):
+                    os.remove(input_video_path)
+            except:
+                pass
+    except Exception as e:
+        print(f"Error in process_video_file: {str(e)}")
+        raise gr.Error(f"Error processing video: {str(e)}")
+# Create the Gradio interface
+with gr.Blocks(title="Promptable Video Redaction") as app:
+    gr.Markdown("# Promptable Video Redaction with Moondream")
+    gr.Markdown(
+        """
+    [Moondream 2B](https://github.com/vikhyat/moondream) is a lightweight vision model that detects and visualizes objects in videos. It can identify objects, people, text and more.
+    Upload a video and specify what to detect. The app will process each frame and apply your chosen visualization style. For help, join the [Moondream Discord](https://discord.com/invite/tRUdpjDQfH).
+    """
+    )
+    with gr.Row():
+        with gr.Column():
+            # Input components
+            video_input = gr.Video(label="Upload Video")
+            detect_input = gr.Textbox(
+                label="What to Detect",
+                placeholder="e.g. face, logo, text, person, car, dog, etc.",
+                value="face",
+                info="Moondream can detect anything that you can describe in natural language",
+            )
+            gr.Examples(
+                examples=[
+                    ["examples/homealone.mp4", "face"],
+                    ["examples/soccer.mp4", "ball"],
+                    ["examples/rally.mp4", "license plate"],
+                ],
+                inputs=[video_input, detect_input],
+                label="Try these examples",
+            )
+            process_btn = gr.Button("Process Video", variant="primary")
+            with gr.Accordion("Advanced Settings", open=False):
+                box_style_input = gr.Radio(
+                    choices=["censor", "bounding-box", "hitmarker"],
+                    value="censor",
+                    label="Visualization Style",
+                    info="Choose how to display detections",
+                )
+                preset_input = gr.Dropdown(
+                    choices=[
+                        "ultrafast",
+                        "superfast",
+                        "veryfast",
+                        "faster",
+                        "fast",
+                        "medium",
+                        "slow",
+                        "slower",
+                        "veryslow",
+                    ],
+                    value="medium",
+                    label="Processing Speed (faster = lower quality)",
+                )
+                with gr.Row():
+                    rows_input = gr.Slider(
+                        minimum=1, maximum=4, value=1, step=1, label="Grid Rows"
+                    )
+                    cols_input = gr.Slider(
+                        minimum=1, maximum=4, value=1, step=1, label="Grid Columns"
+                    )
+                test_mode_input = gr.Checkbox(
+                    label="Test Mode (Process first 3 seconds only)",
+                    value=True,
+                    info="Enable to quickly test settings on a short clip before processing the full video (recommended)",
+                )
+                gr.Markdown(
+                    """
+                Note: Processing in test mode will only process the first 3 seconds of the video and is recommended for testing settings.
+                """
+                )
+                gr.Markdown(
+                    """
+                We can get a rough estimate of how long the video will take to process by multiplying the videos framerate * seconds * the number of rows and columns and assuming 0.12 seconds processing time per detection.
+                For example, a 3 second video at 30fps with 2x2 grid, the estimated time is 3 * 30 * 2 * 2 * 0.12 = 43.2 seconds (tested on a 4090 GPU).
+                """
+                )
+        with gr.Column():
+            # Output components
+            video_output = gr.Video(label="Processed Video")
+            # About section under the video output
+            gr.Markdown(
+                """
+            ### Links:
+            - [GitHub Repository](https://github.com/vikhyat/moondream)
+            - [Hugging Face](https://huggingface.co/vikhyatk/moondream2)
+            - [Python Package](https://pypi.org/project/moondream/)
+            - [Moondream Recipes](https://docs.moondream.ai/recipes)
+            """
+            )
+    # Event handlers
+    process_btn.click(
+        fn=process_video_file,
+        inputs=[
+            video_input,
+            detect_input,
+            box_style_input,
+            preset_input,
+            rows_input,
+            cols_input,
+            test_mode_input,
+        ],
+        outputs=video_output,
+    )
+if __name__ == "__main__":
+    app.launch(share=True)

main.py ADDED Viewed

	@@ -0,0 +1,742 @@

+#!/usr/bin/env python3
+import cv2, os, subprocess, argparse
+from PIL import Image
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from tqdm import tqdm
+import numpy as np
+from datetime import datetime
+# Constants
+TEST_MODE_DURATION = 3  # Process only first 3 seconds in test mode
+FFMPEG_PRESETS = [
+    "ultrafast",
+    "superfast",
+    "veryfast",
+    "faster",
+    "fast",
+    "medium",
+    "slow",
+    "slower",
+    "veryslow",
+]
+FONT = cv2.FONT_HERSHEY_SIMPLEX  # Font for bounding-box-style labels
+# Detection parameters
+IOU_THRESHOLD = 0.5  # IoU threshold for considering boxes related
+# Hitmarker parameters
+HITMARKER_SIZE = 20  # Size of the hitmarker in pixels
+HITMARKER_GAP = 3  # Size of the empty space in the middle (reduced from 8)
+HITMARKER_THICKNESS = 2  # Thickness of hitmarker lines
+HITMARKER_COLOR = (255, 255, 255)  # White color for hitmarker
+HITMARKER_SHADOW_COLOR = (80, 80, 80)  # Lighter gray for shadow effect
+HITMARKER_SHADOW_OFFSET = 1  # Smaller shadow offset
+def load_moondream():
+    """Load Moondream model and tokenizer."""
+    model = AutoModelForCausalLM.from_pretrained(
+        "vikhyatk/moondream2", trust_remote_code=True, device_map={"": "cuda"}
+    )
+    tokenizer = AutoTokenizer.from_pretrained("vikhyatk/moondream2")
+    return model, tokenizer
+def get_video_properties(video_path):
+    """Get basic video properties."""
+    video = cv2.VideoCapture(video_path)
+    fps = video.get(cv2.CAP_PROP_FPS)
+    frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
+    width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
+    height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    video.release()
+    return {"fps": fps, "frame_count": frame_count, "width": width, "height": height}
+def is_valid_box(box):
+    """Check if box coordinates are reasonable."""
+    x1, y1, x2, y2 = box
+    width = x2 - x1
+    height = y2 - y1
+    # Reject boxes that are too large (over 90% of frame in both dimensions)
+    if width > 0.9 and height > 0.9:
+        return False
+    # Reject boxes that are too small (less than 1% of frame)
+    if width < 0.01 or height < 0.01:
+        return False
+    return True
+def split_frame_into_tiles(frame, rows, cols):
+    """Split a frame into a grid of tiles."""
+    height, width = frame.shape[:2]
+    tile_height = height // rows
+    tile_width = width // cols
+    tiles = []
+    tile_positions = []
+    for i in range(rows):
+        for j in range(cols):
+            y1 = i * tile_height
+            y2 = (i + 1) * tile_height if i < rows - 1 else height
+            x1 = j * tile_width
+            x2 = (j + 1) * tile_width if j < cols - 1 else width
+            tile = frame[y1:y2, x1:x2]
+            tiles.append(tile)
+            tile_positions.append((x1, y1, x2, y2))
+    return tiles, tile_positions
+def convert_tile_coords_to_frame(box, tile_pos, frame_shape):
+    """Convert coordinates from tile space to frame space."""
+    frame_height, frame_width = frame_shape[:2]
+    tile_x1, tile_y1, tile_x2, tile_y2 = tile_pos
+    tile_width = tile_x2 - tile_x1
+    tile_height = tile_y2 - tile_y1
+    x1_tile_abs = box[0] * tile_width
+    y1_tile_abs = box[1] * tile_height
+    x2_tile_abs = box[2] * tile_width
+    y2_tile_abs = box[3] * tile_height
+    x1_frame_abs = tile_x1 + x1_tile_abs
+    y1_frame_abs = tile_y1 + y1_tile_abs
+    x2_frame_abs = tile_x1 + x2_tile_abs
+    y2_frame_abs = tile_y1 + y2_tile_abs
+    x1_norm = x1_frame_abs / frame_width
+    y1_norm = y1_frame_abs / frame_height
+    x2_norm = x2_frame_abs / frame_width
+    y2_norm = y2_frame_abs / frame_height
+    x1_norm = max(0.0, min(1.0, x1_norm))
+    y1_norm = max(0.0, min(1.0, y1_norm))
+    x2_norm = max(0.0, min(1.0, x2_norm))
+    y2_norm = max(0.0, min(1.0, y2_norm))
+    return [x1_norm, y1_norm, x2_norm, y2_norm]
+def merge_tile_detections(tile_detections, iou_threshold=0.5):
+    """Merge detections from different tiles using NMS-like approach."""
+    if not tile_detections:
+        return []
+    all_boxes = []
+    all_keywords = []
+    # Collect all boxes and their keywords
+    for detections in tile_detections:
+        for box, keyword in detections:
+            all_boxes.append(box)
+            all_keywords.append(keyword)
+    if not all_boxes:
+        return []
+    # Convert to numpy for easier processing
+    boxes = np.array(all_boxes)
+    # Calculate areas
+    x1 = boxes[:, 0]
+    y1 = boxes[:, 1]
+    x2 = boxes[:, 2]
+    y2 = boxes[:, 3]
+    areas = (x2 - x1) * (y2 - y1)
+    # Sort boxes by area
+    order = areas.argsort()[::-1]
+    keep = []
+    while order.size > 0:
+        i = order[0]
+        keep.append(i)
+        if order.size == 1:
+            break
+        # Calculate IoU with rest of boxes
+        xx1 = np.maximum(x1[i], x1[order[1:]])
+        yy1 = np.maximum(y1[i], y1[order[1:]])
+        xx2 = np.minimum(x2[i], x2[order[1:]])
+        yy2 = np.minimum(y2[i], y2[order[1:]])
+        w = np.maximum(0.0, xx2 - xx1)
+        h = np.maximum(0.0, yy2 - yy1)
+        inter = w * h
+        ovr = inter / (areas[i] + areas[order[1:]] - inter)
+        # Get indices of boxes with IoU less than threshold
+        inds = np.where(ovr <= iou_threshold)[0]
+        order = order[inds + 1]
+    return [(all_boxes[i], all_keywords[i]) for i in keep]
+def detect_ads_in_frame(model, tokenizer, image, detect_keyword, rows=1, cols=1):
+    """Detect objects in a frame using grid-based detection."""
+    if rows == 1 and cols == 1:
+        return detect_ads_in_frame_single(model, tokenizer, image, detect_keyword)
+    # Convert numpy array to PIL Image if needed
+    if not isinstance(image, Image.Image):
+        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
+    # Split frame into tiles
+    tiles, tile_positions = split_frame_into_tiles(image, rows, cols)
+    # Process each tile
+    tile_detections = []
+    for tile, tile_pos in zip(tiles, tile_positions):
+        # Convert tile to PIL Image
+        tile_pil = Image.fromarray(tile)
+        # Detect objects in tile
+        response = model.detect(tile_pil, detect_keyword)
+        if response and "objects" in response and response["objects"]:
+            objects = response["objects"]
+            tile_objects = []
+            for obj in objects:
+                if all(k in obj for k in ["x_min", "y_min", "x_max", "y_max"]):
+                    box = [obj["x_min"], obj["y_min"], obj["x_max"], obj["y_max"]]
+                    if is_valid_box(box):
+                        # Convert tile coordinates to frame coordinates
+                        frame_box = convert_tile_coords_to_frame(
+                            box, tile_pos, image.shape
+                        )
+                        tile_objects.append((frame_box, detect_keyword))
+            if tile_objects:  # Only append if we found valid objects
+                tile_detections.append(tile_objects)
+    # Merge detections from all tiles
+    merged_detections = merge_tile_detections(tile_detections)
+    return merged_detections
+def detect_ads_in_frame_single(model, tokenizer, image, detect_keyword):
+    """Single-frame detection function."""
+    detected_objects = []
+    # Convert numpy array to PIL Image if needed
+    if not isinstance(image, Image.Image):
+        image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
+    # Detect objects
+    response = model.detect(image, detect_keyword)
+    # Check if we have valid objects
+    if response and "objects" in response and response["objects"]:
+        objects = response["objects"]
+        for obj in objects:
+            if all(k in obj for k in ["x_min", "y_min", "x_max", "y_max"]):
+                box = [obj["x_min"], obj["y_min"], obj["x_max"], obj["y_max"]]
+                # If box is valid (not full-frame), add it
+                if is_valid_box(box):
+                    detected_objects.append((box, detect_keyword))
+    return detected_objects
+def draw_hitmarker(
+    frame, center_x, center_y, size=HITMARKER_SIZE, color=HITMARKER_COLOR, shadow=True
+):
+    """Draw a COD-style hitmarker cross with more space in the middle."""
+    half_size = size // 2
+    # Draw shadow first if enabled
+    if shadow:
+        # Top-left to center shadow
+        cv2.line(
+            frame,
+            (
+                center_x - half_size + HITMARKER_SHADOW_OFFSET,
+                center_y - half_size + HITMARKER_SHADOW_OFFSET,
+            ),
+            (
+                center_x - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+                center_y - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+            ),
+            HITMARKER_SHADOW_COLOR,
+            HITMARKER_THICKNESS,
+        )
+        # Top-right to center shadow
+        cv2.line(
+            frame,
+            (
+                center_x + half_size + HITMARKER_SHADOW_OFFSET,
+                center_y - half_size + HITMARKER_SHADOW_OFFSET,
+            ),
+            (
+                center_x + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+                center_y - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+            ),
+            HITMARKER_SHADOW_COLOR,
+            HITMARKER_THICKNESS,
+        )
+        # Bottom-left to center shadow
+        cv2.line(
+            frame,
+            (
+                center_x - half_size + HITMARKER_SHADOW_OFFSET,
+                center_y + half_size + HITMARKER_SHADOW_OFFSET,
+            ),
+            (
+                center_x - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+                center_y + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+            ),
+            HITMARKER_SHADOW_COLOR,
+            HITMARKER_THICKNESS,
+        )
+        # Bottom-right to center shadow
+        cv2.line(
+            frame,
+            (
+                center_x + half_size + HITMARKER_SHADOW_OFFSET,
+                center_y + half_size + HITMARKER_SHADOW_OFFSET,
+            ),
+            (
+                center_x + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+                center_y + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
+            ),
+            HITMARKER_SHADOW_COLOR,
+            HITMARKER_THICKNESS,
+        )
+    # Draw main hitmarker
+    # Top-left to center
+    cv2.line(
+        frame,
+        (center_x - half_size, center_y - half_size),
+        (center_x - HITMARKER_GAP, center_y - HITMARKER_GAP),
+        color,
+        HITMARKER_THICKNESS,
+    )
+    # Top-right to center
+    cv2.line(
+        frame,
+        (center_x + half_size, center_y - half_size),
+        (center_x + HITMARKER_GAP, center_y - HITMARKER_GAP),
+        color,
+        HITMARKER_THICKNESS,
+    )
+    # Bottom-left to center
+    cv2.line(
+        frame,
+        (center_x - half_size, center_y + half_size),
+        (center_x - HITMARKER_GAP, center_y + HITMARKER_GAP),
+        color,
+        HITMARKER_THICKNESS,
+    )
+    # Bottom-right to center
+    cv2.line(
+        frame,
+        (center_x + half_size, center_y + half_size),
+        (center_x + HITMARKER_GAP, center_y + HITMARKER_GAP),
+        color,
+        HITMARKER_THICKNESS,
+    )
+def draw_ad_boxes(frame, detected_objects, detect_keyword, box_style="censor"):
+    """Draw detection visualizations over detected objects.
+    Args:
+        frame: The video frame to draw on
+        detected_objects: List of (box, keyword) tuples
+        detect_keyword: The detection keyword
+        box_style: Visualization style ('censor', 'bounding-box', or 'hitmarker')
+    """
+    height, width = frame.shape[:2]
+    for box, keyword in detected_objects:
+        try:
+            # Convert normalized coordinates to pixel coordinates
+            x1 = int(box[0] * width)
+            y1 = int(box[1] * height)
+            x2 = int(box[2] * width)
+            y2 = int(box[3] * height)
+            # Ensure coordinates are within frame boundaries
+            x1 = max(0, min(x1, width - 1))
+            y1 = max(0, min(y1, height - 1))
+            x2 = max(0, min(x2, width - 1))
+            y2 = max(0, min(y2, height - 1))
+            # Only draw if box has reasonable size
+            if x2 > x1 and y2 > y1:
+                if box_style == "censor":
+                    # Draw solid black rectangle
+                    cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 0), -1)
+                elif box_style == "bounding-box":
+                    # Draw red rectangle with thicker line
+                    cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 255), 3)
+                    # Add label with background
+                    label = detect_keyword  # Use exact capitalization
+                    label_size = cv2.getTextSize(label, FONT, 0.7, 2)[0]
+                    cv2.rectangle(
+                        frame, (x1, y1 - 25), (x1 + label_size[0], y1), (0, 0, 255), -1
+                    )
+                    cv2.putText(
+                        frame,
+                        label,
+                        (x1, y1 - 6),
+                        FONT,
+                        0.7,
+                        (255, 255, 255),
+                        2,
+                        cv2.LINE_AA,
+                    )
+                elif box_style == "hitmarker":
+                    # Calculate center of the box
+                    center_x = (x1 + x2) // 2
+                    center_y = (y1 + y2) // 2
+                    # Draw hitmarker at the center
+                    draw_hitmarker(frame, center_x, center_y)
+                    # Optional: Add small label above hitmarker
+                    label = detect_keyword  # Use exact capitalization
+                    label_size = cv2.getTextSize(label, FONT, 0.5, 1)[0]
+                    cv2.putText(
+                        frame,
+                        label,
+                        (center_x - label_size[0] // 2, center_y - HITMARKER_SIZE - 5),
+                        FONT,
+                        0.5,
+                        HITMARKER_COLOR,
+                        1,
+                        cv2.LINE_AA,
+                    )
+        except Exception as e:
+            print(f"Error drawing {box_style} style box: {str(e)}")
+    return frame
+def filter_temporal_outliers(detections_dict):
+    """Filter out extremely large detections that take up most of the frame.
+    Only keeps detections that are reasonable in size.
+    Args:
+        detections_dict: Dictionary of {frame_number: [(box, keyword), ...]}
+    """
+    filtered_detections = {}
+    for t, detections in detections_dict.items():
+        # Only keep detections that aren't too large
+        valid_detections = []
+        for box, keyword in detections:
+            # Calculate box size as percentage of frame
+            width = box[2] - box[0]
+            height = box[3] - box[1]
+            area = width * height
+            # If box is less than 90% of frame, keep it
+            if area < 0.9:
+                valid_detections.append((box, keyword))
+        if valid_detections:
+            filtered_detections[t] = valid_detections
+    return filtered_detections
+def describe_frames(
+    video_path, model, tokenizer, detect_keyword, test_mode=False, rows=1, cols=1
+):
+    """Extract and detect objects in frames."""
+    props = get_video_properties(video_path)
+    fps = props["fps"]
+    # If in test mode, only process first 3 seconds
+    if test_mode:
+        frame_count = min(int(fps * TEST_MODE_DURATION), props["frame_count"])
+    else:
+        frame_count = props["frame_count"]
+    ad_detections = {}  # Store detection results by frame number
+    print("Extracting frames and detecting objects...")
+    video = cv2.VideoCapture(video_path)
+    # Process every frame
+    frame_count_processed = 0
+    with tqdm(total=frame_count) as pbar:
+        while frame_count_processed < frame_count:
+            ret, frame = video.read()
+            if not ret:
+                break
+            # Detect objects in the frame
+            detected_objects = detect_ads_in_frame(
+                model, tokenizer, frame, detect_keyword, rows=rows, cols=cols
+            )
+            # Store results for every frame, even if empty
+            ad_detections[frame_count_processed] = detected_objects
+            frame_count_processed += 1
+            pbar.update(1)
+    video.release()
+    if frame_count_processed == 0:
+        print("No frames could be read from video")
+        return {}
+    # Filter out only extremely large detections
+    ad_detections = filter_temporal_outliers(ad_detections)
+    return ad_detections
+def create_detection_video(
+    video_path,
+    ad_detections,
+    detect_keyword,
+    output_path=None,
+    ffmpeg_preset="medium",
+    test_mode=False,
+    box_style="censor",
+):
+    """Create video with detection boxes."""
+    if output_path is None:
+        # Create outputs directory if it doesn't exist
+        outputs_dir = os.path.join(
+            os.path.dirname(os.path.abspath(__file__)), "outputs"
+        )
+        os.makedirs(outputs_dir, exist_ok=True)
+        # Clean the detect_keyword for filename
+        safe_keyword = "".join(
+            x for x in detect_keyword if x.isalnum() or x in (" ", "_", "-")
+        )
+        safe_keyword = safe_keyword.replace(" ", "_")
+        # Create output filename
+        base_name = os.path.splitext(os.path.basename(video_path))[0]
+        output_path = os.path.join(
+            outputs_dir, f"{box_style}_{safe_keyword}_{base_name}.mp4"
+        )
+    print(f"Will save output to: {output_path}")
+    props = get_video_properties(video_path)
+    fps, width, height = props["fps"], props["width"], props["height"]
+    # If in test mode, only process first few seconds
+    if test_mode:
+        frame_count = min(int(fps * TEST_MODE_DURATION), props["frame_count"])
+    else:
+        frame_count = props["frame_count"]
+    video = cv2.VideoCapture(video_path)
+    # Create temp output path by adding _temp before the extension
+    base, ext = os.path.splitext(output_path)
+    temp_output = f"{base}_temp{ext}"
+    out = cv2.VideoWriter(
+        temp_output, cv2.VideoWriter_fourcc(*"mp4v"), fps, (width, height)
+    )
+    print("Creating detection video...")
+    frame_count_processed = 0
+    with tqdm(total=frame_count) as pbar:
+        while frame_count_processed < frame_count:
+            ret, frame = video.read()
+            if not ret:
+                break
+            # Get detections for this exact frame
+            if frame_count_processed in ad_detections:
+                current_detections = ad_detections[frame_count_processed]
+                if current_detections:
+                    frame = draw_ad_boxes(
+                        frame, current_detections, detect_keyword, box_style=box_style
+                    )
+            out.write(frame)
+            frame_count_processed += 1
+            pbar.update(1)
+    video.release()
+    out.release()
+    # Convert to web-compatible format more efficiently
+    try:
+        subprocess.run(
+            [
+                "ffmpeg",
+                "-y",
+                "-i",
+                temp_output,
+                "-c:v",
+                "libx264",
+                "-preset",
+                ffmpeg_preset,
+                "-crf",
+                "23",
+                "-movflags",
+                "+faststart",  # Better web playback
+                "-loglevel",
+                "error",
+                output_path,
+            ],
+            check=True,
+        )
+        os.remove(temp_output)  # Remove the temporary file
+        if not os.path.exists(output_path):
+            print(
+                f"Warning: FFmpeg completed but output file not found at {output_path}"
+            )
+            return None
+        return output_path
+    except subprocess.CalledProcessError as e:
+        print(f"Error running FFmpeg: {str(e)}")
+        if os.path.exists(temp_output):
+            os.remove(temp_output)
+        return None
+def process_video(
+    video_path,
+    detect_keyword,
+    test_mode=False,
+    ffmpeg_preset="medium",
+    rows=1,
+    cols=1,
+    box_style="censor",
+):
+    """Process a single video file."""
+    print(f"\nProcessing: {video_path}")
+    print(f"Looking for: {detect_keyword}")
+    # Load model
+    print("Loading Moondream model...")
+    model, tokenizer = load_moondream()
+    # Process video - detect objects
+    ad_detections = describe_frames(
+        video_path, model, tokenizer, detect_keyword, test_mode, rows, cols
+    )
+    # Create video with detection boxes
+    output_path = create_detection_video(
+        video_path,
+        ad_detections,
+        detect_keyword,
+        ffmpeg_preset=ffmpeg_preset,
+        test_mode=test_mode,
+        box_style=box_style,
+    )
+    if output_path is None:
+        print("\nError: Failed to create output video")
+        return None
+    print(f"\nOutput saved to: {output_path}")
+    return output_path
+def main():
+    """Process all videos in the inputs directory."""
+    parser = argparse.ArgumentParser(
+        description="Detect objects in videos using Moondream2"
+    )
+    parser.add_argument(
+        "--test", action="store_true", help="Process only first 3 seconds of each video"
+    )
+    parser.add_argument(
+        "--preset",
+        choices=FFMPEG_PRESETS,
+        default="medium",
+        help="FFmpeg encoding preset (default: medium). Faster presets = lower quality",
+    )
+    parser.add_argument(
+        "--detect",
+        type=str,
+        default="face",
+        help='Object to detect in the video (default: face, use --detect "thing to detect" to override)',
+    )
+    parser.add_argument(
+        "--rows",
+        type=int,
+        default=1,
+        help="Number of rows to split each frame into (default: 1)",
+    )
+    parser.add_argument(
+        "--cols",
+        type=int,
+        default=1,
+        help="Number of columns to split each frame into (default: 1)",
+    )
+    parser.add_argument(
+        "--box-style",
+        choices=["censor", "bounding-box", "hitmarker"],
+        default="censor",
+        help="Style of detection visualization (default: censor)",
+    )
+    args = parser.parse_args()
+    input_dir = "inputs"
+    os.makedirs(input_dir, exist_ok=True)
+    os.makedirs("outputs", exist_ok=True)
+    video_files = [
+        f
+        for f in os.listdir(input_dir)
+        if f.lower().endswith((".mp4", ".avi", ".mov", ".mkv", ".webm"))
+    ]
+    if not video_files:
+        print("No video files found in 'inputs' directory")
+        return
+    print(f"Found {len(video_files)} videos to process")
+    print(f"Will detect: {args.detect}")
+    if args.test:
+        print("Running in test mode - processing only first 3 seconds of each video")
+    print(f"Using FFmpeg preset: {args.preset}")
+    print(f"Grid size: {args.rows}x{args.cols}")
+    print(f"Box style: {args.box_style}")
+    success_count = 0
+    for video_file in video_files:
+        video_path = os.path.join(input_dir, video_file)
+        output_path = process_video(
+            video_path,
+            args.detect,
+            test_mode=args.test,
+            ffmpeg_preset=args.preset,
+            rows=args.rows,
+            cols=args.cols,
+            box_style=args.box_style,
+        )
+        if output_path:
+            success_count += 1
+    print(
+        f"\nProcessing complete. Successfully processed {success_count} out of {len(video_files)} videos."
+    )
+if __name__ == "__main__":
+    main()

packages.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ libvips
2	+ ffmpeg

requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+gradio>=4.0.0
+torch
+transformers
+opencv-python
+pillow
+numpy
+tqdm
+ffmpeg-python
+einops
+pyvips
+accelerate
+# for spaces
+--extra-index-url https://download.pytorch.org/whl/cu113
+torch
+spaces