Midterm Presentation

ROI Tracking in Sports Broadcasts

Automated region-of-interest detection and sponsor overlay analysis using computer vision

Raghav Enrique Giovanni Martina

March 2026

Meet the Team

RaghavMS CSE Harvard

EnriqueMS CSE Harvard

GiovanniMS CSE Polimi

MartinaMS CSE Polimi

press → to reveal team

01

Problem Outline

Goal

Real-time custom advertising for event livestreams, to integrate with personalized recommendation systems.

Focused Scope

Develop a real-time ROI tracking system for tennis matches.
Detect advertisement boards and replace them with dynamic ads.

press → to highlight target region

Kia logo

Original broadcast footage sample

02

Key Challenges

Core technical hurdles we need to solve for reliable, real-time ad replacement.

03

Existing Approaches & Landscape

COMMERCIAL SYSTEMS

Supponor (NHL DED)

IR strips in dasherboards + AI keying. $1.28B ad revenue (2023-24). Requires proprietary hardware at every venue.

uniqFEED (AdApt)

Software-based, deployed at major tennis events. Requires trained operators and custom-trained CV models per sport.

Vizrt / Viz Arena

Camera tracking hardware (encoder heads). Real-time but hardware-dependent and operator-intensive.

ACADEMIC WORK

Homography estimation — focused on soccer field registration (Nie et al. WACV 2021, Homayounfar CVPR 2017). Not applied to ad replacement.

SAM 2 in sports — used for ball tracking and player tracking. Not applied to advertisement board segmentation.

Virtual ad insertion research — soccer-only, predates foundation models, uses hand-crafted features.

KNOWN CHALLENGES

Jittering

Overlay instability across frames — tracking drift causes visible shaking in replaced banners.

Occlusion

Players walking in front of banners cause artifacts — overlays render on top of or through players.

Speed

Processing speed below real-time thresholds, limiting viability for live broadcast scenarios.

No prior work combines foundation model segmentation with classical CV for end-to-end sports ad replacement.

Each component exists individually — the pipeline does not.

04

Why This Combination is Novel

Foundation models (SAM 2) eliminate the need for hardware, custom training, and continuous operators for the segmentation problem. The remaining challenges — perspective geometry, compositing, occlusion — are solved with specific classical CV techniques.

SAM 2 Segmentation

Homography Fitting

Optical Flow Detection

Inpainting & Shadow Match

Player Segmentation

Final Composite

FOUNDATION MODELCLASSICAL CV

Stable Tracking

Homography fitting combined with optical flow produces consistent overlays with minimal jitter across frames.

Occlusion Handling

Dedicated player segmentation pass generates per-pixel masks, allowing overlays to render correctly behind players.

Targeting 30 fps

Early results suggest real-time performance is achievable with GPU acceleration. Benchmarks in progress.

No prior work assembles this specific pipeline — zero-shot segmentation + classical CV geometry + per-pixel occlusion handling.

05

How We Compare

System	Hardware	Training	Operators	Occlusion	Stability	Latency
Supponor (NHL)	IR strips	Proprietary	Central hub	No (clipping)	High	Real-time
uniqFEED	None	Custom CV	Trained ops	Partial (bbox)	High	Near real-time
Vizrt / Viz Arena	Camera HW	Required	Required	Partial (bbox)	High	Real-time
MEIL (Prior)	None	Custom	Manual	Limited	Jittering	Below real-time
This Pipeline	None	None (SAM 2)	1 click	Per-pixel masks	Stable	~30 fps*

Fully software-based · Zero training · Minimal operator input · Per-pixel occlusion · Stable tracking

* Preliminary estimate — benchmarks in progress

06

Full Pipeline Overview

07

Segment Anything
Model 2

We use Meta's SAM 2 as the backbone for banner segmentation. Given a single click prompt on the first frame, SAM 2 tracks and segments across all subsequent frames.

Pre-trained on 11M images, 1B+ masks
Prompt with points, boxes, or masks
Memory bank for temporal consistency

SAM 2 architecture — image encoder, memory attention, mask decoder, memory bank

SAM 2 architecture — Ravi et al., 2024

08

Player
Segmentation

Detect and segment players on the court to ensure overlaid banners render behind them, preserving a natural viewing experience.

09

Banner Segmentation
& Tracking

Detect advertising banners across different regions of the court — ground-level boards, back wall panels, net-mounted banners, umpire stand signage, and more.

10

Banner Segmentation

Banners Stable Camera
Banners Moving Camera
Logos Stable Camera
Logos Moving Camera
Camera Cutsexperiment

11

Homography &
Perspective Geometry

Understand the geometry and perspective of the banners from the camera's viewpoint so that overlays and modifications appear realistic to the viewer in the final livestream.

Motivation

Original frame

Detect region

Rectify to flat

New logo (flat)

Warp overlay back

11

Vanishing
Point

The vanishing point constrains how banner edges converge — it's not a true parallelogram but a perspective quadrilateral.

11

Quadrilateral
Fitting

Original frame
SAM mask
Binary mask
Min-area rectangle
Split along axes
Fit edge lines
Intersect corners
Rectified view

12

New Logo
Overlay

InpaintRemove the original logo from the banner surface
Shadow / Tint MatchingMatch luminosity and color tone of the surface
Border BlendingSmooth alpha edges for seamless compositing

13

Demo

Stable CameraCamera MovementPlayer Overlay (experimental)

Future Improvements

Visual

Visual Improvements

Ball tracking
Occlusion handling
Shadows & lighting correction

Cuts

Camera Cuts

Detect and handle camera angle transitions gracefully

Speed

Real-Time Performance

Model fine-tuning for speed
Inference benchmarking at 30 fps target

Auto

Automatic Region Selection

Remove manual click-to-select step
Auto-detect ad regions from frame content

Sports

Other Sports

Generalize pipeline beyond tennis to football, basketball, etc.

Asks From
Mitsubishi

To move to the next phase, we need a few things from the Mitsubishi team.

Sample Video

Sample footage from existing Mitsubishi models or internal datasets to benchmark and validate our pipeline against real-world broadcast conditions.

Resources

Access to GPU compute for training and real-time inference. Our pipeline targets 30 fps — knowing available hardware helps us tune model complexity.

Pipeline Requirements

Expected livestream format, resolution, and codec. Desired input/output interface — should we consume an RTMP stream and produce one, or work frame-by-frame?

Thank You

Raghav, Enrique, Giovanni, Martina