vikhyatk commited on
Commit
195fd31
·
0 Parent(s):

initial commit

Browse files
Files changed (6) hide show
  1. .gitignore +51 -0
  2. README.md +199 -0
  3. app.py +214 -0
  4. main.py +742 -0
  5. packages.txt +2 -0
  6. requirements.txt +15 -0
.gitignore ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ env/
26
+ ENV/
27
+ .venv/
28
+
29
+ # IDE
30
+ .idea/
31
+ .vscode/
32
+ *.swp
33
+ *.swo
34
+
35
+ # Project specific
36
+ inputs/*
37
+ outputs/*
38
+ !inputs/.gitkeep
39
+ !outputs/.gitkeep
40
+ inputs/
41
+ outputs/
42
+
43
+ # Model files
44
+ *.pth
45
+ *.onnx
46
+ *.pt
47
+
48
+ # Logs
49
+ *.log
50
+
51
+ certificate.pem
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Video Redaction
3
+ emoji: 🐨
4
+ colorFrom: yellow
5
+ colorTo: gray
6
+ sdk: gradio
7
+ sdk_version: 5.14.0
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # Promptable Video Redaction with Moondream
13
+
14
+ This tool uses Moondream 2B, a powerful yet lightweight vision-language model, to detect and redact objects from videos. Moondream can recognize a wide variety of objects, people, text, and more with high accuracy while being much smaller than traditional models.
15
+
16
+ [Try it now.](https://huggingface.co/spaces/moondream/promptable-video-redaction)
17
+
18
+ ## About Moondream
19
+
20
+ Moondream is a tiny yet powerful vision-language model that can analyze images and answer questions about them. It's designed to be lightweight and efficient while maintaining high accuracy. Some key features:
21
+
22
+ - Only 2B parameters
23
+ - Fast inference with minimal resource requirements
24
+ - Supports CPU and GPU execution
25
+ - Open source and free to use
26
+ - Can detect almost anything you can describe in natural language
27
+
28
+ Links:
29
+ - [GitHub Repository](https://github.com/vikhyat/moondream)
30
+ - [Hugging Face](https://huggingface.co/vikhyatk/moondream2)
31
+ - [Build with Moondream](http://docs.moondream.ai/)
32
+
33
+ ## Features
34
+
35
+ - Real-time object detection in videos using Moondream
36
+ - Multiple visualization styles:
37
+ - Censor: Black boxes over detected objects
38
+ - Bounding Box: Traditional bounding boxes with labels
39
+ - Hitmarker: Call of Duty style crosshair markers
40
+ - Optional grid-based detection for improved accuracy
41
+ - Flexible object type detection using natural language
42
+ - Frame-by-frame processing with IoU-based merging
43
+ - Batch processing of multiple videos
44
+ - Web-compatible output format
45
+ - User-friendly web interface
46
+ - Command-line interface for automation
47
+
48
+ ## Requirements
49
+
50
+ - Python 3.8+
51
+ - OpenCV (cv2)
52
+ - PyTorch
53
+ - Transformers
54
+ - Pillow (PIL)
55
+ - tqdm
56
+ - ffmpeg
57
+ - numpy
58
+ - gradio (for web interface)
59
+
60
+ ## Installation
61
+
62
+ 1. Clone this repository and create a new virtual environment
63
+ ```bash
64
+ git clone https://github.com/vikhyat/moondream/blob/main/recipes/promptable-video-redaction
65
+ python -m venv .venv
66
+ source .venv/bin/activate
67
+ ```
68
+ 2. Install the required packages:
69
+ ```bash
70
+ pip install -r requirements.txt
71
+ ```
72
+ 3. Install ffmpeg:
73
+ - On Ubuntu/Debian: `sudo apt-get install ffmpeg libvips`
74
+ - On macOS: `brew install ffmpeg`
75
+ - On Windows: Download from [ffmpeg.org](https://ffmpeg.org/download.html)
76
+ > Downloading libvips for Windows requires some additional steps, see [here](https://docs.moondream.ai/quick-start)
77
+
78
+ ## Usage
79
+
80
+ ### Web Interface
81
+
82
+ 1. Start the web interface:
83
+ ```bash
84
+ python app.py
85
+ ```
86
+
87
+ 2. Open the provided URL in your browser
88
+
89
+ 3. Use the interface to:
90
+ - Upload your video
91
+ - Specify what to censor (e.g., face, logo, text)
92
+ - Adjust processing speed and quality
93
+ - Configure grid size for detection
94
+ - Process and download the censored video
95
+
96
+ ### Command Line Interface
97
+
98
+ 1. Create an `inputs` directory in the same folder as the script:
99
+ ```bash
100
+ mkdir inputs
101
+ ```
102
+
103
+ 2. Place your video files in the `inputs` directory. Supported formats:
104
+ - .mp4
105
+ - .avi
106
+ - .mov
107
+ - .mkv
108
+ - .webm
109
+
110
+ 3. Run the script:
111
+ ```bash
112
+ python main.py
113
+ ```
114
+
115
+ ### Optional Arguments:
116
+ - `--test`: Process only first 3 seconds of each video (useful for testing detection settings)
117
+ ```bash
118
+ python main.py --test
119
+ ```
120
+
121
+ - `--preset`: Choose FFmpeg encoding preset (affects output quality vs. speed)
122
+ ```bash
123
+ python main.py --preset ultrafast # Fastest, lower quality
124
+ python main.py --preset veryslow # Slowest, highest quality
125
+ ```
126
+
127
+ - `--detect`: Specify what object type to detect (using natural language)
128
+ ```bash
129
+ python main.py --detect person # Detect people
130
+ python main.py --detect "red car" # Detect red cars
131
+ python main.py --detect "person wearing a hat" # Detect people with hats
132
+ ```
133
+
134
+ - `--box-style`: Choose visualization style
135
+ ```bash
136
+ python main.py --box-style censor # Black boxes (default)
137
+ python main.py --box-style bounding-box # Bounding box-style boxes with labels
138
+ python main.py --box-style hitmarker # COD-style hitmarkers
139
+ ```
140
+
141
+ - `--rows` and `--cols`: Enable grid-based detection by splitting frames
142
+ ```bash
143
+ python main.py --rows 2 --cols 2 # Split each frame into 2x2 grid
144
+ python main.py --rows 3 --cols 3 # Split each frame into 3x3 grid
145
+ ```
146
+
147
+ You can combine arguments:
148
+ ```bash
149
+ python main.py --detect "person wearing sunglasses" --box-style bounding-box --test --preset "fast" --rows 2 --cols 2
150
+ ```
151
+
152
+ ### Visualization Styles
153
+
154
+ The tool supports three different visualization styles for detected objects:
155
+
156
+ 1. **Censor** (default)
157
+ - Places solid black rectangles over detected objects
158
+ - Best for privacy and content moderation
159
+ - Completely obscures the detected region
160
+
161
+ 2. **Bounding Box**
162
+ - Traditional object detection style
163
+ - Red bounding box around detected objects
164
+ - Label showing object type above the box
165
+ - Good for analysis and debugging
166
+
167
+ 3. **Hitmarker**
168
+ - Call of Duty inspired visualization
169
+ - White crosshair marker at center of detected objects
170
+ - Small label above the marker
171
+ - Stylistic choice for gaming-inspired visualization
172
+
173
+ Choose the style that best fits your use case using the `--box-style` argument.
174
+
175
+ ## Output
176
+
177
+ Processed videos will be saved in the `outputs` directory with the format:
178
+ `[style]_[object_type]_[original_filename].mp4`
179
+
180
+ For example:
181
+ - `censor_face_video.mp4`
182
+ - `bounding-box_person_video.mp4`
183
+ - `hitmarker_car_video.mp4`
184
+
185
+ The output videos will include:
186
+ - Original video content
187
+ - Selected visualization style for detected objects
188
+ - Web-compatible H.264 encoding
189
+
190
+ ## Notes
191
+
192
+ - Processing time depends on video length, grid size, and GPU availability
193
+ - GPU is strongly recommended for faster processing
194
+ - Requires sufficient disk space for temporary files
195
+ - Detection quality varies based on video quality and Moondream's ability to recognize the specified object
196
+ - Grid-based detection impacts performance significantly - use only when needed
197
+ - Web interface shows progress updates and errors
198
+ - Choose visualization style based on your use case
199
+ - Moondream can detect almost anything you can describe in natural language
app.py ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import gradio as gr
3
+ import os
4
+ from main import load_moondream, process_video
5
+ import tempfile
6
+ import shutil
7
+ import torch
8
+
9
+ # import spaces
10
+
11
+ # Get absolute path to workspace root
12
+ WORKSPACE_ROOT = os.path.dirname(os.path.abspath(__file__))
13
+
14
+ # Check CUDA availability
15
+ print(f"Is CUDA available: {torch.cuda.is_available()}")
16
+ # We want to get True
17
+ print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
18
+ # GPU Name
19
+
20
+ # Initialize model globally for reuse
21
+ print("Loading Moondream model...")
22
+ model, tokenizer = load_moondream()
23
+
24
+
25
+ # Uncomment for Hugging Face Spaces
26
+ # @spaces.GPU(duration=120)
27
+ def process_video_file(
28
+ video_file, detect_keyword, box_style, ffmpeg_preset, rows, cols, test_mode
29
+ ):
30
+ """Process a video file through the Gradio interface."""
31
+ try:
32
+ if not video_file:
33
+ raise gr.Error("Please upload a video file")
34
+
35
+ # Ensure input/output directories exist using absolute paths
36
+ inputs_dir = os.path.join(WORKSPACE_ROOT, "inputs")
37
+ outputs_dir = os.path.join(WORKSPACE_ROOT, "outputs")
38
+ os.makedirs(inputs_dir, exist_ok=True)
39
+ os.makedirs(outputs_dir, exist_ok=True)
40
+
41
+ # Copy uploaded video to inputs directory
42
+ video_filename = f"input_{os.path.basename(video_file)}"
43
+ input_video_path = os.path.join(inputs_dir, video_filename)
44
+ shutil.copy2(video_file, input_video_path)
45
+
46
+ try:
47
+ # Process the video
48
+ output_path = process_video(
49
+ input_video_path,
50
+ detect_keyword,
51
+ test_mode=test_mode,
52
+ ffmpeg_preset=ffmpeg_preset,
53
+ rows=rows,
54
+ cols=cols,
55
+ box_style=box_style,
56
+ )
57
+
58
+ # Verify output exists and is readable
59
+ if not output_path or not os.path.exists(output_path):
60
+ print(f"Warning: Output path {output_path} does not exist")
61
+ # Try to find the output based on expected naming convention
62
+ expected_output = os.path.join(
63
+ outputs_dir, f"{box_style}_{detect_keyword}_{video_filename}"
64
+ )
65
+ if os.path.exists(expected_output):
66
+ output_path = expected_output
67
+ else:
68
+ # Try searching in outputs directory for any matching file
69
+ matching_files = [
70
+ f
71
+ for f in os.listdir(outputs_dir)
72
+ if f.startswith(f"{box_style}_{detect_keyword}_")
73
+ ]
74
+ if matching_files:
75
+ output_path = os.path.join(outputs_dir, matching_files[0])
76
+ else:
77
+ raise gr.Error("Failed to locate output video")
78
+
79
+ # Convert output path to absolute path if it isn't already
80
+ if not os.path.isabs(output_path):
81
+ output_path = os.path.join(WORKSPACE_ROOT, output_path)
82
+
83
+ print(f"Returning output path: {output_path}")
84
+ return output_path
85
+
86
+ finally:
87
+ # Clean up input file
88
+ try:
89
+ if os.path.exists(input_video_path):
90
+ os.remove(input_video_path)
91
+ except:
92
+ pass
93
+
94
+ except Exception as e:
95
+ print(f"Error in process_video_file: {str(e)}")
96
+ raise gr.Error(f"Error processing video: {str(e)}")
97
+
98
+
99
+ # Create the Gradio interface
100
+ with gr.Blocks(title="Promptable Video Redaction") as app:
101
+ gr.Markdown("# Promptable Video Redaction with Moondream")
102
+ gr.Markdown(
103
+ """
104
+ [Moondream 2B](https://github.com/vikhyat/moondream) is a lightweight vision model that detects and visualizes objects in videos. It can identify objects, people, text and more.
105
+
106
+ Upload a video and specify what to detect. The app will process each frame and apply your chosen visualization style. For help, join the [Moondream Discord](https://discord.com/invite/tRUdpjDQfH).
107
+ """
108
+ )
109
+
110
+ with gr.Row():
111
+ with gr.Column():
112
+ # Input components
113
+ video_input = gr.Video(label="Upload Video")
114
+
115
+ detect_input = gr.Textbox(
116
+ label="What to Detect",
117
+ placeholder="e.g. face, logo, text, person, car, dog, etc.",
118
+ value="face",
119
+ info="Moondream can detect anything that you can describe in natural language",
120
+ )
121
+
122
+ gr.Examples(
123
+ examples=[
124
+ ["examples/homealone.mp4", "face"],
125
+ ["examples/soccer.mp4", "ball"],
126
+ ["examples/rally.mp4", "license plate"],
127
+ ],
128
+ inputs=[video_input, detect_input],
129
+ label="Try these examples",
130
+ )
131
+
132
+ process_btn = gr.Button("Process Video", variant="primary")
133
+
134
+ with gr.Accordion("Advanced Settings", open=False):
135
+ box_style_input = gr.Radio(
136
+ choices=["censor", "bounding-box", "hitmarker"],
137
+ value="censor",
138
+ label="Visualization Style",
139
+ info="Choose how to display detections",
140
+ )
141
+ preset_input = gr.Dropdown(
142
+ choices=[
143
+ "ultrafast",
144
+ "superfast",
145
+ "veryfast",
146
+ "faster",
147
+ "fast",
148
+ "medium",
149
+ "slow",
150
+ "slower",
151
+ "veryslow",
152
+ ],
153
+ value="medium",
154
+ label="Processing Speed (faster = lower quality)",
155
+ )
156
+ with gr.Row():
157
+ rows_input = gr.Slider(
158
+ minimum=1, maximum=4, value=1, step=1, label="Grid Rows"
159
+ )
160
+ cols_input = gr.Slider(
161
+ minimum=1, maximum=4, value=1, step=1, label="Grid Columns"
162
+ )
163
+
164
+ test_mode_input = gr.Checkbox(
165
+ label="Test Mode (Process first 3 seconds only)",
166
+ value=True,
167
+ info="Enable to quickly test settings on a short clip before processing the full video (recommended)",
168
+ )
169
+
170
+ gr.Markdown(
171
+ """
172
+ Note: Processing in test mode will only process the first 3 seconds of the video and is recommended for testing settings.
173
+ """
174
+ )
175
+
176
+ gr.Markdown(
177
+ """
178
+ We can get a rough estimate of how long the video will take to process by multiplying the videos framerate * seconds * the number of rows and columns and assuming 0.12 seconds processing time per detection.
179
+ For example, a 3 second video at 30fps with 2x2 grid, the estimated time is 3 * 30 * 2 * 2 * 0.12 = 43.2 seconds (tested on a 4090 GPU).
180
+ """
181
+ )
182
+
183
+ with gr.Column():
184
+ # Output components
185
+ video_output = gr.Video(label="Processed Video")
186
+
187
+ # About section under the video output
188
+ gr.Markdown(
189
+ """
190
+ ### Links:
191
+ - [GitHub Repository](https://github.com/vikhyat/moondream)
192
+ - [Hugging Face](https://huggingface.co/vikhyatk/moondream2)
193
+ - [Python Package](https://pypi.org/project/moondream/)
194
+ - [Moondream Recipes](https://docs.moondream.ai/recipes)
195
+ """
196
+ )
197
+
198
+ # Event handlers
199
+ process_btn.click(
200
+ fn=process_video_file,
201
+ inputs=[
202
+ video_input,
203
+ detect_input,
204
+ box_style_input,
205
+ preset_input,
206
+ rows_input,
207
+ cols_input,
208
+ test_mode_input,
209
+ ],
210
+ outputs=video_output,
211
+ )
212
+
213
+ if __name__ == "__main__":
214
+ app.launch(share=True)
main.py ADDED
@@ -0,0 +1,742 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import cv2, os, subprocess, argparse
3
+ from PIL import Image
4
+ import torch
5
+ from transformers import AutoModelForCausalLM, AutoTokenizer
6
+ from tqdm import tqdm
7
+ import numpy as np
8
+ from datetime import datetime
9
+
10
+ # Constants
11
+ TEST_MODE_DURATION = 3 # Process only first 3 seconds in test mode
12
+ FFMPEG_PRESETS = [
13
+ "ultrafast",
14
+ "superfast",
15
+ "veryfast",
16
+ "faster",
17
+ "fast",
18
+ "medium",
19
+ "slow",
20
+ "slower",
21
+ "veryslow",
22
+ ]
23
+ FONT = cv2.FONT_HERSHEY_SIMPLEX # Font for bounding-box-style labels
24
+
25
+ # Detection parameters
26
+ IOU_THRESHOLD = 0.5 # IoU threshold for considering boxes related
27
+
28
+ # Hitmarker parameters
29
+ HITMARKER_SIZE = 20 # Size of the hitmarker in pixels
30
+ HITMARKER_GAP = 3 # Size of the empty space in the middle (reduced from 8)
31
+ HITMARKER_THICKNESS = 2 # Thickness of hitmarker lines
32
+ HITMARKER_COLOR = (255, 255, 255) # White color for hitmarker
33
+ HITMARKER_SHADOW_COLOR = (80, 80, 80) # Lighter gray for shadow effect
34
+ HITMARKER_SHADOW_OFFSET = 1 # Smaller shadow offset
35
+
36
+
37
+ def load_moondream():
38
+ """Load Moondream model and tokenizer."""
39
+ model = AutoModelForCausalLM.from_pretrained(
40
+ "vikhyatk/moondream2", trust_remote_code=True, device_map={"": "cuda"}
41
+ )
42
+ tokenizer = AutoTokenizer.from_pretrained("vikhyatk/moondream2")
43
+ return model, tokenizer
44
+
45
+
46
+ def get_video_properties(video_path):
47
+ """Get basic video properties."""
48
+ video = cv2.VideoCapture(video_path)
49
+ fps = video.get(cv2.CAP_PROP_FPS)
50
+ frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
51
+ width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
52
+ height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
53
+ video.release()
54
+ return {"fps": fps, "frame_count": frame_count, "width": width, "height": height}
55
+
56
+
57
+ def is_valid_box(box):
58
+ """Check if box coordinates are reasonable."""
59
+ x1, y1, x2, y2 = box
60
+ width = x2 - x1
61
+ height = y2 - y1
62
+
63
+ # Reject boxes that are too large (over 90% of frame in both dimensions)
64
+ if width > 0.9 and height > 0.9:
65
+ return False
66
+
67
+ # Reject boxes that are too small (less than 1% of frame)
68
+ if width < 0.01 or height < 0.01:
69
+ return False
70
+
71
+ return True
72
+
73
+
74
+ def split_frame_into_tiles(frame, rows, cols):
75
+ """Split a frame into a grid of tiles."""
76
+ height, width = frame.shape[:2]
77
+ tile_height = height // rows
78
+ tile_width = width // cols
79
+ tiles = []
80
+ tile_positions = []
81
+
82
+ for i in range(rows):
83
+ for j in range(cols):
84
+ y1 = i * tile_height
85
+ y2 = (i + 1) * tile_height if i < rows - 1 else height
86
+ x1 = j * tile_width
87
+ x2 = (j + 1) * tile_width if j < cols - 1 else width
88
+
89
+ tile = frame[y1:y2, x1:x2]
90
+ tiles.append(tile)
91
+ tile_positions.append((x1, y1, x2, y2))
92
+
93
+ return tiles, tile_positions
94
+
95
+
96
+ def convert_tile_coords_to_frame(box, tile_pos, frame_shape):
97
+ """Convert coordinates from tile space to frame space."""
98
+ frame_height, frame_width = frame_shape[:2]
99
+ tile_x1, tile_y1, tile_x2, tile_y2 = tile_pos
100
+ tile_width = tile_x2 - tile_x1
101
+ tile_height = tile_y2 - tile_y1
102
+
103
+ x1_tile_abs = box[0] * tile_width
104
+ y1_tile_abs = box[1] * tile_height
105
+ x2_tile_abs = box[2] * tile_width
106
+ y2_tile_abs = box[3] * tile_height
107
+
108
+ x1_frame_abs = tile_x1 + x1_tile_abs
109
+ y1_frame_abs = tile_y1 + y1_tile_abs
110
+ x2_frame_abs = tile_x1 + x2_tile_abs
111
+ y2_frame_abs = tile_y1 + y2_tile_abs
112
+
113
+ x1_norm = x1_frame_abs / frame_width
114
+ y1_norm = y1_frame_abs / frame_height
115
+ x2_norm = x2_frame_abs / frame_width
116
+ y2_norm = y2_frame_abs / frame_height
117
+
118
+ x1_norm = max(0.0, min(1.0, x1_norm))
119
+ y1_norm = max(0.0, min(1.0, y1_norm))
120
+ x2_norm = max(0.0, min(1.0, x2_norm))
121
+ y2_norm = max(0.0, min(1.0, y2_norm))
122
+
123
+ return [x1_norm, y1_norm, x2_norm, y2_norm]
124
+
125
+
126
+ def merge_tile_detections(tile_detections, iou_threshold=0.5):
127
+ """Merge detections from different tiles using NMS-like approach."""
128
+ if not tile_detections:
129
+ return []
130
+
131
+ all_boxes = []
132
+ all_keywords = []
133
+
134
+ # Collect all boxes and their keywords
135
+ for detections in tile_detections:
136
+ for box, keyword in detections:
137
+ all_boxes.append(box)
138
+ all_keywords.append(keyword)
139
+
140
+ if not all_boxes:
141
+ return []
142
+
143
+ # Convert to numpy for easier processing
144
+ boxes = np.array(all_boxes)
145
+
146
+ # Calculate areas
147
+ x1 = boxes[:, 0]
148
+ y1 = boxes[:, 1]
149
+ x2 = boxes[:, 2]
150
+ y2 = boxes[:, 3]
151
+ areas = (x2 - x1) * (y2 - y1)
152
+
153
+ # Sort boxes by area
154
+ order = areas.argsort()[::-1]
155
+
156
+ keep = []
157
+ while order.size > 0:
158
+ i = order[0]
159
+ keep.append(i)
160
+
161
+ if order.size == 1:
162
+ break
163
+
164
+ # Calculate IoU with rest of boxes
165
+ xx1 = np.maximum(x1[i], x1[order[1:]])
166
+ yy1 = np.maximum(y1[i], y1[order[1:]])
167
+ xx2 = np.minimum(x2[i], x2[order[1:]])
168
+ yy2 = np.minimum(y2[i], y2[order[1:]])
169
+
170
+ w = np.maximum(0.0, xx2 - xx1)
171
+ h = np.maximum(0.0, yy2 - yy1)
172
+ inter = w * h
173
+
174
+ ovr = inter / (areas[i] + areas[order[1:]] - inter)
175
+
176
+ # Get indices of boxes with IoU less than threshold
177
+ inds = np.where(ovr <= iou_threshold)[0]
178
+ order = order[inds + 1]
179
+
180
+ return [(all_boxes[i], all_keywords[i]) for i in keep]
181
+
182
+
183
+ def detect_ads_in_frame(model, tokenizer, image, detect_keyword, rows=1, cols=1):
184
+ """Detect objects in a frame using grid-based detection."""
185
+ if rows == 1 and cols == 1:
186
+ return detect_ads_in_frame_single(model, tokenizer, image, detect_keyword)
187
+
188
+ # Convert numpy array to PIL Image if needed
189
+ if not isinstance(image, Image.Image):
190
+ image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
191
+
192
+ # Split frame into tiles
193
+ tiles, tile_positions = split_frame_into_tiles(image, rows, cols)
194
+
195
+ # Process each tile
196
+ tile_detections = []
197
+ for tile, tile_pos in zip(tiles, tile_positions):
198
+ # Convert tile to PIL Image
199
+ tile_pil = Image.fromarray(tile)
200
+
201
+ # Detect objects in tile
202
+ response = model.detect(tile_pil, detect_keyword)
203
+
204
+ if response and "objects" in response and response["objects"]:
205
+ objects = response["objects"]
206
+ tile_objects = []
207
+
208
+ for obj in objects:
209
+ if all(k in obj for k in ["x_min", "y_min", "x_max", "y_max"]):
210
+ box = [obj["x_min"], obj["y_min"], obj["x_max"], obj["y_max"]]
211
+
212
+ if is_valid_box(box):
213
+ # Convert tile coordinates to frame coordinates
214
+ frame_box = convert_tile_coords_to_frame(
215
+ box, tile_pos, image.shape
216
+ )
217
+ tile_objects.append((frame_box, detect_keyword))
218
+
219
+ if tile_objects: # Only append if we found valid objects
220
+ tile_detections.append(tile_objects)
221
+
222
+ # Merge detections from all tiles
223
+ merged_detections = merge_tile_detections(tile_detections)
224
+ return merged_detections
225
+
226
+
227
+ def detect_ads_in_frame_single(model, tokenizer, image, detect_keyword):
228
+ """Single-frame detection function."""
229
+ detected_objects = []
230
+
231
+ # Convert numpy array to PIL Image if needed
232
+ if not isinstance(image, Image.Image):
233
+ image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
234
+
235
+ # Detect objects
236
+ response = model.detect(image, detect_keyword)
237
+
238
+ # Check if we have valid objects
239
+ if response and "objects" in response and response["objects"]:
240
+ objects = response["objects"]
241
+
242
+ for obj in objects:
243
+ if all(k in obj for k in ["x_min", "y_min", "x_max", "y_max"]):
244
+ box = [obj["x_min"], obj["y_min"], obj["x_max"], obj["y_max"]]
245
+ # If box is valid (not full-frame), add it
246
+ if is_valid_box(box):
247
+ detected_objects.append((box, detect_keyword))
248
+
249
+ return detected_objects
250
+
251
+
252
+ def draw_hitmarker(
253
+ frame, center_x, center_y, size=HITMARKER_SIZE, color=HITMARKER_COLOR, shadow=True
254
+ ):
255
+ """Draw a COD-style hitmarker cross with more space in the middle."""
256
+ half_size = size // 2
257
+
258
+ # Draw shadow first if enabled
259
+ if shadow:
260
+ # Top-left to center shadow
261
+ cv2.line(
262
+ frame,
263
+ (
264
+ center_x - half_size + HITMARKER_SHADOW_OFFSET,
265
+ center_y - half_size + HITMARKER_SHADOW_OFFSET,
266
+ ),
267
+ (
268
+ center_x - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
269
+ center_y - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
270
+ ),
271
+ HITMARKER_SHADOW_COLOR,
272
+ HITMARKER_THICKNESS,
273
+ )
274
+ # Top-right to center shadow
275
+ cv2.line(
276
+ frame,
277
+ (
278
+ center_x + half_size + HITMARKER_SHADOW_OFFSET,
279
+ center_y - half_size + HITMARKER_SHADOW_OFFSET,
280
+ ),
281
+ (
282
+ center_x + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
283
+ center_y - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
284
+ ),
285
+ HITMARKER_SHADOW_COLOR,
286
+ HITMARKER_THICKNESS,
287
+ )
288
+ # Bottom-left to center shadow
289
+ cv2.line(
290
+ frame,
291
+ (
292
+ center_x - half_size + HITMARKER_SHADOW_OFFSET,
293
+ center_y + half_size + HITMARKER_SHADOW_OFFSET,
294
+ ),
295
+ (
296
+ center_x - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
297
+ center_y + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
298
+ ),
299
+ HITMARKER_SHADOW_COLOR,
300
+ HITMARKER_THICKNESS,
301
+ )
302
+ # Bottom-right to center shadow
303
+ cv2.line(
304
+ frame,
305
+ (
306
+ center_x + half_size + HITMARKER_SHADOW_OFFSET,
307
+ center_y + half_size + HITMARKER_SHADOW_OFFSET,
308
+ ),
309
+ (
310
+ center_x + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
311
+ center_y + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET,
312
+ ),
313
+ HITMARKER_SHADOW_COLOR,
314
+ HITMARKER_THICKNESS,
315
+ )
316
+
317
+ # Draw main hitmarker
318
+ # Top-left to center
319
+ cv2.line(
320
+ frame,
321
+ (center_x - half_size, center_y - half_size),
322
+ (center_x - HITMARKER_GAP, center_y - HITMARKER_GAP),
323
+ color,
324
+ HITMARKER_THICKNESS,
325
+ )
326
+ # Top-right to center
327
+ cv2.line(
328
+ frame,
329
+ (center_x + half_size, center_y - half_size),
330
+ (center_x + HITMARKER_GAP, center_y - HITMARKER_GAP),
331
+ color,
332
+ HITMARKER_THICKNESS,
333
+ )
334
+ # Bottom-left to center
335
+ cv2.line(
336
+ frame,
337
+ (center_x - half_size, center_y + half_size),
338
+ (center_x - HITMARKER_GAP, center_y + HITMARKER_GAP),
339
+ color,
340
+ HITMARKER_THICKNESS,
341
+ )
342
+ # Bottom-right to center
343
+ cv2.line(
344
+ frame,
345
+ (center_x + half_size, center_y + half_size),
346
+ (center_x + HITMARKER_GAP, center_y + HITMARKER_GAP),
347
+ color,
348
+ HITMARKER_THICKNESS,
349
+ )
350
+
351
+
352
+ def draw_ad_boxes(frame, detected_objects, detect_keyword, box_style="censor"):
353
+ """Draw detection visualizations over detected objects.
354
+
355
+ Args:
356
+ frame: The video frame to draw on
357
+ detected_objects: List of (box, keyword) tuples
358
+ detect_keyword: The detection keyword
359
+ box_style: Visualization style ('censor', 'bounding-box', or 'hitmarker')
360
+ """
361
+ height, width = frame.shape[:2]
362
+
363
+ for box, keyword in detected_objects:
364
+ try:
365
+ # Convert normalized coordinates to pixel coordinates
366
+ x1 = int(box[0] * width)
367
+ y1 = int(box[1] * height)
368
+ x2 = int(box[2] * width)
369
+ y2 = int(box[3] * height)
370
+
371
+ # Ensure coordinates are within frame boundaries
372
+ x1 = max(0, min(x1, width - 1))
373
+ y1 = max(0, min(y1, height - 1))
374
+ x2 = max(0, min(x2, width - 1))
375
+ y2 = max(0, min(y2, height - 1))
376
+
377
+ # Only draw if box has reasonable size
378
+ if x2 > x1 and y2 > y1:
379
+ if box_style == "censor":
380
+ # Draw solid black rectangle
381
+ cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 0), -1)
382
+ elif box_style == "bounding-box":
383
+ # Draw red rectangle with thicker line
384
+ cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 255), 3)
385
+
386
+ # Add label with background
387
+ label = detect_keyword # Use exact capitalization
388
+ label_size = cv2.getTextSize(label, FONT, 0.7, 2)[0]
389
+ cv2.rectangle(
390
+ frame, (x1, y1 - 25), (x1 + label_size[0], y1), (0, 0, 255), -1
391
+ )
392
+ cv2.putText(
393
+ frame,
394
+ label,
395
+ (x1, y1 - 6),
396
+ FONT,
397
+ 0.7,
398
+ (255, 255, 255),
399
+ 2,
400
+ cv2.LINE_AA,
401
+ )
402
+ elif box_style == "hitmarker":
403
+ # Calculate center of the box
404
+ center_x = (x1 + x2) // 2
405
+ center_y = (y1 + y2) // 2
406
+
407
+ # Draw hitmarker at the center
408
+ draw_hitmarker(frame, center_x, center_y)
409
+
410
+ # Optional: Add small label above hitmarker
411
+ label = detect_keyword # Use exact capitalization
412
+ label_size = cv2.getTextSize(label, FONT, 0.5, 1)[0]
413
+ cv2.putText(
414
+ frame,
415
+ label,
416
+ (center_x - label_size[0] // 2, center_y - HITMARKER_SIZE - 5),
417
+ FONT,
418
+ 0.5,
419
+ HITMARKER_COLOR,
420
+ 1,
421
+ cv2.LINE_AA,
422
+ )
423
+ except Exception as e:
424
+ print(f"Error drawing {box_style} style box: {str(e)}")
425
+
426
+ return frame
427
+
428
+
429
+ def filter_temporal_outliers(detections_dict):
430
+ """Filter out extremely large detections that take up most of the frame.
431
+ Only keeps detections that are reasonable in size.
432
+
433
+ Args:
434
+ detections_dict: Dictionary of {frame_number: [(box, keyword), ...]}
435
+ """
436
+ filtered_detections = {}
437
+
438
+ for t, detections in detections_dict.items():
439
+ # Only keep detections that aren't too large
440
+ valid_detections = []
441
+ for box, keyword in detections:
442
+ # Calculate box size as percentage of frame
443
+ width = box[2] - box[0]
444
+ height = box[3] - box[1]
445
+ area = width * height
446
+
447
+ # If box is less than 90% of frame, keep it
448
+ if area < 0.9:
449
+ valid_detections.append((box, keyword))
450
+
451
+ if valid_detections:
452
+ filtered_detections[t] = valid_detections
453
+
454
+ return filtered_detections
455
+
456
+
457
+ def describe_frames(
458
+ video_path, model, tokenizer, detect_keyword, test_mode=False, rows=1, cols=1
459
+ ):
460
+ """Extract and detect objects in frames."""
461
+ props = get_video_properties(video_path)
462
+ fps = props["fps"]
463
+
464
+ # If in test mode, only process first 3 seconds
465
+ if test_mode:
466
+ frame_count = min(int(fps * TEST_MODE_DURATION), props["frame_count"])
467
+ else:
468
+ frame_count = props["frame_count"]
469
+
470
+ ad_detections = {} # Store detection results by frame number
471
+
472
+ print("Extracting frames and detecting objects...")
473
+ video = cv2.VideoCapture(video_path)
474
+
475
+ # Process every frame
476
+ frame_count_processed = 0
477
+ with tqdm(total=frame_count) as pbar:
478
+ while frame_count_processed < frame_count:
479
+ ret, frame = video.read()
480
+ if not ret:
481
+ break
482
+
483
+ # Detect objects in the frame
484
+ detected_objects = detect_ads_in_frame(
485
+ model, tokenizer, frame, detect_keyword, rows=rows, cols=cols
486
+ )
487
+
488
+ # Store results for every frame, even if empty
489
+ ad_detections[frame_count_processed] = detected_objects
490
+
491
+ frame_count_processed += 1
492
+ pbar.update(1)
493
+
494
+ video.release()
495
+
496
+ if frame_count_processed == 0:
497
+ print("No frames could be read from video")
498
+ return {}
499
+
500
+ # Filter out only extremely large detections
501
+ ad_detections = filter_temporal_outliers(ad_detections)
502
+ return ad_detections
503
+
504
+
505
+ def create_detection_video(
506
+ video_path,
507
+ ad_detections,
508
+ detect_keyword,
509
+ output_path=None,
510
+ ffmpeg_preset="medium",
511
+ test_mode=False,
512
+ box_style="censor",
513
+ ):
514
+ """Create video with detection boxes."""
515
+ if output_path is None:
516
+ # Create outputs directory if it doesn't exist
517
+ outputs_dir = os.path.join(
518
+ os.path.dirname(os.path.abspath(__file__)), "outputs"
519
+ )
520
+ os.makedirs(outputs_dir, exist_ok=True)
521
+
522
+ # Clean the detect_keyword for filename
523
+ safe_keyword = "".join(
524
+ x for x in detect_keyword if x.isalnum() or x in (" ", "_", "-")
525
+ )
526
+ safe_keyword = safe_keyword.replace(" ", "_")
527
+
528
+ # Create output filename
529
+ base_name = os.path.splitext(os.path.basename(video_path))[0]
530
+ output_path = os.path.join(
531
+ outputs_dir, f"{box_style}_{safe_keyword}_{base_name}.mp4"
532
+ )
533
+
534
+ print(f"Will save output to: {output_path}")
535
+
536
+ props = get_video_properties(video_path)
537
+ fps, width, height = props["fps"], props["width"], props["height"]
538
+
539
+ # If in test mode, only process first few seconds
540
+ if test_mode:
541
+ frame_count = min(int(fps * TEST_MODE_DURATION), props["frame_count"])
542
+ else:
543
+ frame_count = props["frame_count"]
544
+
545
+ video = cv2.VideoCapture(video_path)
546
+
547
+ # Create temp output path by adding _temp before the extension
548
+ base, ext = os.path.splitext(output_path)
549
+ temp_output = f"{base}_temp{ext}"
550
+
551
+ out = cv2.VideoWriter(
552
+ temp_output, cv2.VideoWriter_fourcc(*"mp4v"), fps, (width, height)
553
+ )
554
+
555
+ print("Creating detection video...")
556
+ frame_count_processed = 0
557
+
558
+ with tqdm(total=frame_count) as pbar:
559
+ while frame_count_processed < frame_count:
560
+ ret, frame = video.read()
561
+ if not ret:
562
+ break
563
+
564
+ # Get detections for this exact frame
565
+ if frame_count_processed in ad_detections:
566
+ current_detections = ad_detections[frame_count_processed]
567
+ if current_detections:
568
+ frame = draw_ad_boxes(
569
+ frame, current_detections, detect_keyword, box_style=box_style
570
+ )
571
+
572
+ out.write(frame)
573
+ frame_count_processed += 1
574
+ pbar.update(1)
575
+
576
+ video.release()
577
+ out.release()
578
+
579
+ # Convert to web-compatible format more efficiently
580
+ try:
581
+ subprocess.run(
582
+ [
583
+ "ffmpeg",
584
+ "-y",
585
+ "-i",
586
+ temp_output,
587
+ "-c:v",
588
+ "libx264",
589
+ "-preset",
590
+ ffmpeg_preset,
591
+ "-crf",
592
+ "23",
593
+ "-movflags",
594
+ "+faststart", # Better web playback
595
+ "-loglevel",
596
+ "error",
597
+ output_path,
598
+ ],
599
+ check=True,
600
+ )
601
+
602
+ os.remove(temp_output) # Remove the temporary file
603
+
604
+ if not os.path.exists(output_path):
605
+ print(
606
+ f"Warning: FFmpeg completed but output file not found at {output_path}"
607
+ )
608
+ return None
609
+
610
+ return output_path
611
+
612
+ except subprocess.CalledProcessError as e:
613
+ print(f"Error running FFmpeg: {str(e)}")
614
+ if os.path.exists(temp_output):
615
+ os.remove(temp_output)
616
+ return None
617
+
618
+
619
+ def process_video(
620
+ video_path,
621
+ detect_keyword,
622
+ test_mode=False,
623
+ ffmpeg_preset="medium",
624
+ rows=1,
625
+ cols=1,
626
+ box_style="censor",
627
+ ):
628
+ """Process a single video file."""
629
+ print(f"\nProcessing: {video_path}")
630
+ print(f"Looking for: {detect_keyword}")
631
+
632
+ # Load model
633
+ print("Loading Moondream model...")
634
+ model, tokenizer = load_moondream()
635
+
636
+ # Process video - detect objects
637
+ ad_detections = describe_frames(
638
+ video_path, model, tokenizer, detect_keyword, test_mode, rows, cols
639
+ )
640
+
641
+ # Create video with detection boxes
642
+ output_path = create_detection_video(
643
+ video_path,
644
+ ad_detections,
645
+ detect_keyword,
646
+ ffmpeg_preset=ffmpeg_preset,
647
+ test_mode=test_mode,
648
+ box_style=box_style,
649
+ )
650
+
651
+ if output_path is None:
652
+ print("\nError: Failed to create output video")
653
+ return None
654
+
655
+ print(f"\nOutput saved to: {output_path}")
656
+ return output_path
657
+
658
+
659
+ def main():
660
+ """Process all videos in the inputs directory."""
661
+ parser = argparse.ArgumentParser(
662
+ description="Detect objects in videos using Moondream2"
663
+ )
664
+ parser.add_argument(
665
+ "--test", action="store_true", help="Process only first 3 seconds of each video"
666
+ )
667
+ parser.add_argument(
668
+ "--preset",
669
+ choices=FFMPEG_PRESETS,
670
+ default="medium",
671
+ help="FFmpeg encoding preset (default: medium). Faster presets = lower quality",
672
+ )
673
+ parser.add_argument(
674
+ "--detect",
675
+ type=str,
676
+ default="face",
677
+ help='Object to detect in the video (default: face, use --detect "thing to detect" to override)',
678
+ )
679
+ parser.add_argument(
680
+ "--rows",
681
+ type=int,
682
+ default=1,
683
+ help="Number of rows to split each frame into (default: 1)",
684
+ )
685
+ parser.add_argument(
686
+ "--cols",
687
+ type=int,
688
+ default=1,
689
+ help="Number of columns to split each frame into (default: 1)",
690
+ )
691
+ parser.add_argument(
692
+ "--box-style",
693
+ choices=["censor", "bounding-box", "hitmarker"],
694
+ default="censor",
695
+ help="Style of detection visualization (default: censor)",
696
+ )
697
+ args = parser.parse_args()
698
+
699
+ input_dir = "inputs"
700
+ os.makedirs(input_dir, exist_ok=True)
701
+ os.makedirs("outputs", exist_ok=True)
702
+
703
+ video_files = [
704
+ f
705
+ for f in os.listdir(input_dir)
706
+ if f.lower().endswith((".mp4", ".avi", ".mov", ".mkv", ".webm"))
707
+ ]
708
+
709
+ if not video_files:
710
+ print("No video files found in 'inputs' directory")
711
+ return
712
+
713
+ print(f"Found {len(video_files)} videos to process")
714
+ print(f"Will detect: {args.detect}")
715
+ if args.test:
716
+ print("Running in test mode - processing only first 3 seconds of each video")
717
+ print(f"Using FFmpeg preset: {args.preset}")
718
+ print(f"Grid size: {args.rows}x{args.cols}")
719
+ print(f"Box style: {args.box_style}")
720
+
721
+ success_count = 0
722
+ for video_file in video_files:
723
+ video_path = os.path.join(input_dir, video_file)
724
+ output_path = process_video(
725
+ video_path,
726
+ args.detect,
727
+ test_mode=args.test,
728
+ ffmpeg_preset=args.preset,
729
+ rows=args.rows,
730
+ cols=args.cols,
731
+ box_style=args.box_style,
732
+ )
733
+ if output_path:
734
+ success_count += 1
735
+
736
+ print(
737
+ f"\nProcessing complete. Successfully processed {success_count} out of {len(video_files)} videos."
738
+ )
739
+
740
+
741
+ if __name__ == "__main__":
742
+ main()
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ libvips
2
+ ffmpeg
requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ torch
3
+ transformers
4
+ opencv-python
5
+ pillow
6
+ numpy
7
+ tqdm
8
+ ffmpeg-python
9
+ einops
10
+ pyvips
11
+ accelerate
12
+ # for spaces
13
+ --extra-index-url https://download.pytorch.org/whl/cu113
14
+ torch
15
+ spaces