press → to reveal team
Real-time custom advertising for event livestreams, to integrate with personalized recommendation systems.
press → to highlight target region

Original broadcast footage sample
Core technical hurdles we need to solve for reliable, real-time ad replacement.
Supponor (NHL DED)
IR strips in dasherboards + AI keying. $1.28B ad revenue (2023-24). Requires proprietary hardware at every venue.
uniqFEED (AdApt)
Software-based, deployed at major tennis events. Requires trained operators and custom-trained CV models per sport.
Vizrt / Viz Arena
Camera tracking hardware (encoder heads). Real-time but hardware-dependent and operator-intensive.
Homography estimation — focused on soccer field registration (Nie et al. WACV 2021, Homayounfar CVPR 2017). Not applied to ad replacement.
SAM 2 in sports — used for ball tracking and player tracking. Not applied to advertisement board segmentation.
Virtual ad insertion research — soccer-only, predates foundation models, uses hand-crafted features.
Jittering
Overlay instability across frames — tracking drift causes visible shaking in replaced banners.
Occlusion
Players walking in front of banners cause artifacts — overlays render on top of or through players.
Speed
Processing speed below real-time thresholds, limiting viability for live broadcast scenarios.
No prior work combines foundation model segmentation with classical CV for end-to-end sports ad replacement.
Each component exists individually — the pipeline does not.
Foundation models (SAM 2) eliminate the need for hardware, custom training, and continuous operators for the segmentation problem. The remaining challenges — perspective geometry, compositing, occlusion — are solved with specific classical CV techniques.
Stable Tracking
Homography fitting combined with optical flow produces consistent overlays with minimal jitter across frames.
Occlusion Handling
Dedicated player segmentation pass generates per-pixel masks, allowing overlays to render correctly behind players.
Targeting 30 fps
Early results suggest real-time performance is achievable with GPU acceleration. Benchmarks in progress.
No prior work assembles this specific pipeline — zero-shot segmentation + classical CV geometry + per-pixel occlusion handling.
How we got from the midterm-era manually-clicked corners to a pipeline that tracks the camera through a player walkover. Phase 2 failed; Phase 3 worked. The simpler answer won on visual review.
| System | Hardware | Training | Operators | Occlusion | Stability | Latency |
|---|---|---|---|---|---|---|
| Supponor (NHL) | IR strips | Proprietary | Central hub | No (clipping) | High | Real-time |
| uniqFEED | None | Custom CV | Trained ops | Partial (bbox) | High | Near real-time |
| Vizrt / Viz Arena | Camera HW | Required | Required | Partial (bbox) | High | Real-time |
| MELIC (Prior) | None | Custom | Manual | Limited | Jittering | Below real-time |
| This Pipeline | None | None (SAM 2) | 1 click | Per-pixel masks | Stable | 2.68 fps (H200)* |
Fully software-based · Zero training · Minimal operator input · Per-pixel occlusion · Stable tracking
* Measured end-to-end on H200, 767-frame demo clip; ~30 fps real-time target is future work
We use Meta's SAM 2 as the backbone for banner segmentation: it requires Manual ROI selection on frame 0 — a single click, box, or mask prompt — and then tracks and segments the region across all subsequent frames.

SAM 2 Architecture — Meta AI, “SAM 2: Segment Anything in Images and Videos” (arXiv:2408.00714)
Detect and segment players on the court to ensure overlaid banners render behind them, preserving a natural viewing experience.
Detect advertising banners across different regions of the court — ground-level boards, back wall panels, net-mounted banners, umpire stand signage, and more.
SAM 3 introduces Promptable Concept Segmentation: given a text prompt or positive/negative exemplar prompts, it detects, segments, and tracks all instances of a concept across video frames, enabling automatic ROI detection and tracking.

SAM 3 Architecture — Meta AI, “SAM 3: Segment Anything with Concepts” (arXiv:2511.16719)
A light version that runs SAM 3 (Detector + Tracker) only when the scene actually changes, and reuses the previous masks (Tracker) otherwise. This brings throughput from ~1 fps (full SAM 3) to ~3–4 fps on the same A100-80GB hardware.
prompt “logo”
prompt “sponsor logo on fixed advertising board”
segmented objects 10
Understand the geometry and perspective of the banners from the camera's viewpoint so that overlays and modifications appear realistic to the viewer in the final livestream.





The vanishing point constrains how banner edges converge — it's not a true parallelogram but a perspective quadrilateral.









Before the BallTrackerNet learned-keypoint detector, the per-frame court geometry ran on a classical line-based estimator: Hough lines on court markings → cluster by orientation → estimate the depth vanishing point → fit the homography.
Wall banners and side panels are fitted with rays projecting from the depth or width vanishing point, gives geometrically correct perspective without solving full intrinsics.
The Hough estimator is frame-to-frame noisy, even with smoothing, projected corners deviate 5–15 px between frames on a static camera. Phase 2 swepttolerance_px ∈ {2..30}under hybrid_lock; only the always-locked V68 baseline passed all gates.
Learned 14-keypoint detector (BallTrackerNet) replaces the Hough+VP estimator in the final. VP-constrained fitter code lives onfeat/court-geometry-stabilisation.

depth VP estimated from court markings



Real baked-in ads on the back banners + floor wordmark (Kia, YoPRO, Melbourne).
Top row = unmodified original broadcast (real baked-in ads, our quality bar). Bottom row = our composite.




3 distinct back-wall banner positions, same frame (f0350). Temporal SSIM 0.9999 · jitter 0.291, visually identical to V68 gold.
Auto-detected walkover window: frames 685–723 · 5 key frames span entry → contact → exit





Video loops automatically. Use ← / → to step through.
The final-run results in one screen, what the eval framework measured onexperiments/2026-05-05_18-38-39_hull_H200/.
one-liner
V68 manually-clicked corners + BallTrackerNet dynamic homography + hybrid_lock at 30-px tolerance + V68's LED-blend compositor, five placements, broadcast-stable, occluded correctly through a player walkover.
Every candidate gets scored at three layers before it can ship: deterministic numerical gates, a structured visual rubric, and direct side-by-side review against the original baked-in ads.
Per-region scorecards. Each region must pass all gated metrics; one warning-only set (ΔE, noise variance, edge sharpness) surfaces but doesn't gate.
Exit code 0 = pass · 2 = scorecard fail · 3 = regression vs gold reference
Per-region 1–5 score across 13 dimensions. Inputs are paired top=original / bottom=composite crop strips so the original baked-in ad is always the comparison anchor.
Human review of the actual output video against the original broadcast. The accept/reject vote; tie-breaks the numerical and rubric layers when they disagree.
example tie-break
A rubric-favorite candidate (layered shadow synthesis + aggressive erase_text + tight banner padding) scored 5/5 on the v2 user-flagged dimensionsand passed every numerical gate.
On direct viewing: floor shadow read as blob, MELBOURNE wordmark erasure flattened the floor context, harder banner edges read as "pasted on".
→ Final pick: P3-A1, the simpler BTN port baseline, with V68's compositor unchanged.
takeaway
Numerical gates and the visual rubric are great regression detectors; the rubric is great for surfacing dimensions worth inspecting. But the final accept/reject decision needs a human looking at the actual video against the original broadcast, the metrics we ship are the ones we'd defend on screen, not just on paper.
Every experiment ran on Modal, a serverless GPU platform. Up to 10 concurrent GPUs let us run parallel waves of experiments, ~50 H200 runs across 14 waves of iteration in Phase 3. Configs are frozen per-run; outputs land in experiments/<timestamp>_<gpu>/.
10 concurrent H200 slots let us launch 5–8 parallel runs per wave. Phase 3 ran ~50 runs in <14 hours wall clock, most of that budget went on per-cycle evaluation + visual review, not GPU inference itself.