press → to reveal team
Real-time custom advertising for event livestreams, to integrate with personalized recommendation systems.
press → to highlight target region

Original broadcast footage sample
Core technical hurdles we need to solve for reliable, real-time ad replacement.
Supponor (NHL DED)
IR strips in dasherboards + AI keying. $1.28B ad revenue (2023-24). Requires proprietary hardware at every venue.
uniqFEED (AdApt)
Software-based, deployed at major tennis events. Requires trained operators and custom-trained CV models per sport.
Vizrt / Viz Arena
Camera tracking hardware (encoder heads). Real-time but hardware-dependent and operator-intensive.
Homography estimation — focused on soccer field registration (Nie et al. WACV 2021, Homayounfar CVPR 2017). Not applied to ad replacement.
SAM 2 in sports — used for ball tracking and player tracking. Not applied to advertisement board segmentation.
Virtual ad insertion research — soccer-only, predates foundation models, uses hand-crafted features.
Jittering
Overlay instability across frames — tracking drift causes visible shaking in replaced banners.
Occlusion
Players walking in front of banners cause artifacts — overlays render on top of or through players.
Speed
Processing speed below real-time thresholds, limiting viability for live broadcast scenarios.
No prior work combines foundation model segmentation with classical CV for end-to-end sports ad replacement.
Each component exists individually — the pipeline does not.
Foundation models (SAM 2) eliminate the need for hardware, custom training, and continuous operators for the segmentation problem. The remaining challenges — perspective geometry, compositing, occlusion — are solved with specific classical CV techniques.
Stable Tracking
Homography fitting combined with optical flow produces consistent overlays with minimal jitter across frames.
Occlusion Handling
Dedicated player segmentation pass generates per-pixel masks, allowing overlays to render correctly behind players.
Targeting 30 fps
Early results suggest real-time performance is achievable with GPU acceleration. Benchmarks in progress.
No prior work assembles this specific pipeline — zero-shot segmentation + classical CV geometry + per-pixel occlusion handling.
| System | Hardware | Training | Operators | Occlusion | Stability | Latency |
|---|---|---|---|---|---|---|
| Supponor (NHL) | IR strips | Proprietary | Central hub | No (clipping) | High | Real-time |
| uniqFEED | None | Custom CV | Trained ops | Partial (bbox) | High | Near real-time |
| Vizrt / Viz Arena | Camera HW | Required | Required | Partial (bbox) | High | Real-time |
| MEIL (Prior) | None | Custom | Manual | Limited | Jittering | Below real-time |
| This Pipeline | None | None (SAM 2) | 1 click | Per-pixel masks | Stable | ~30 fps* |
Fully software-based · Zero training · Minimal operator input · Per-pixel occlusion · Stable tracking
* Preliminary estimate — benchmarks in progress
We use Meta's SAM 2 as the backbone for banner segmentation. Given a single click prompt on the first frame, SAM 2 tracks and segments across all subsequent frames.

SAM 2 architecture — Ravi et al., 2024
Detect and segment players on the court to ensure overlaid banners render behind them, preserving a natural viewing experience.
Detect advertising banners across different regions of the court — ground-level boards, back wall panels, net-mounted banners, umpire stand signage, and more.
Understand the geometry and perspective of the banners from the camera's viewpoint so that overlays and modifications appear realistic to the viewer in the final livestream.





The vanishing point constrains how banner edges converge — it's not a true parallelogram but a perspective quadrilateral.












To move to the next phase, we need a few things from the Mitsubishi team.
Sample footage from existing Mitsubishi models or internal datasets to benchmark and validate our pipeline against real-world broadcast conditions.
Access to GPU compute for training and real-time inference. Our pipeline targets 30 fps — knowing available hardware helps us tune model complexity.
Expected livestream format, resolution, and codec. Desired input/output interface — should we consume an RTMP stream and produce one, or work frame-by-frame?