Final Presentation

ROI Tracking in Sports Broadcasts

Automated region-of-interest detection and virtual ad insertion in tennis broadcasts

Raghav Enrique Giovanni Martina

May 2026

Meet the Team

RaghavMS CSE Harvard

EnriqueMS CSE Harvard

GiovanniMS CSE Polimi

MartinaMS CSE Polimi

press → to reveal team

01

Problem Outline

Goal

Real-time custom advertising for event livestreams, to integrate with personalized recommendation systems.

Focused Scope

Develop a real-time ROI tracking system for tennis matches.
Detect advertisement boards and replace them with dynamic ads.

press → to highlight target region

Kia logo

Original broadcast footage sample

02

Key Challenges

Core technical hurdles we need to solve for reliable, real-time ad replacement.

03

Existing Approaches & Landscape

COMMERCIAL SYSTEMS

Supponor (NHL DED)

IR strips in dasherboards + AI keying. $1.28B ad revenue (2023-24). Requires proprietary hardware at every venue.

uniqFEED (AdApt)

Software-based, deployed at major tennis events. Requires trained operators and custom-trained CV models per sport.

Vizrt / Viz Arena

Camera tracking hardware (encoder heads). Real-time but hardware-dependent and operator-intensive.

ACADEMIC WORK

Homography estimation — focused on soccer field registration (Nie et al. WACV 2021, Homayounfar CVPR 2017). Not applied to ad replacement.

SAM 2 in sports — used for ball tracking and player tracking. Not applied to advertisement board segmentation.

Virtual ad insertion research — soccer-only, predates foundation models, uses hand-crafted features.

KNOWN CHALLENGES

Jittering

Overlay instability across frames — tracking drift causes visible shaking in replaced banners.

Occlusion

Players walking in front of banners cause artifacts — overlays render on top of or through players.

Speed

Processing speed below real-time thresholds, limiting viability for live broadcast scenarios.

No prior work combines foundation model segmentation with classical CV for end-to-end sports ad replacement.

Each component exists individually — the pipeline does not.

04

Why This Combination is Novel

Foundation models (SAM 2) eliminate the need for hardware, custom training, and continuous operators for the segmentation problem. The remaining challenges — perspective geometry, compositing, occlusion — are solved with specific classical CV techniques.

SAM 2 Segmentation

Homography Fitting

Optical Flow Detection

Inpainting & Shadow Match

Player Segmentation

Final Composite

FOUNDATION MODELCLASSICAL CV

Stable Tracking

Homography fitting combined with optical flow produces consistent overlays with minimal jitter across frames.

Occlusion Handling

Dedicated player segmentation pass generates per-pixel masks, allowing overlays to render correctly behind players.

Targeting 30 fps

Early results suggest real-time performance is achievable with GPU acceleration. Benchmarks in progress.

No prior work assembles this specific pipeline — zero-shot segmentation + classical CV geometry + per-pixel occlusion handling.

05

Project journey

How we got from the midterm-era manually-clicked corners to a pipeline that tracks the camera through a player walkover. Phase 2 failed; Phase 3 worked. The simpler answer won on visual review.

Phase 1Apr 30

V68 · clicked corners

Manually clicked court corners on a seed frame
Static homography across all 767 frames
5 placements live: 3 back banners + left + floor

Looks perfect when the camera is still, drifts off the court the moment it moves.

Phase 2May 4

hybrid_lock + line-based estimator

Per-frame Hough+RANSAC court detection
7 tolerance sweeps · 3 ramp speeds
Gated by hybrid_lock state machine

All tolerances regress floor SSIM monotonically. Line estimator is frame-to-frame too noisy. Failed axis, gold remains V68.

Phase 3May 5–6

BallTrackerNet learned-keypoint port

14-channel CNN court-keypoint detector
RANSAC over 14 keypoints → homography
~50 H200 runs across 14 iteration waves

Stable enough to gate on with hybrid_lock@30. Per-frame estimates ramp in only when motion exceeds tolerance.

FinalMay 6

P3-A1 · what we ship

V68 manually-clicked corners (seed)
BTN dynamic homography · hybrid_lock@30
V68 compositor unchanged, none of the experimental tweaks

All 4 region scorecards pass · walkover_occlusion_iou = 0.985 · temporal SSIM ≥ 0.99 every region.

06

How We Compare

System	Hardware	Training	Operators	Occlusion	Stability	Latency
Supponor (NHL)	IR strips	Proprietary	Central hub	No (clipping)	High	Real-time
uniqFEED	None	Custom CV	Trained ops	Partial (bbox)	High	Near real-time
Vizrt / Viz Arena	Camera HW	Required	Required	Partial (bbox)	High	Real-time
MELIC (Prior)	None	Custom	Manual	Limited	Jittering	Below real-time
This Pipeline	None	None (SAM 2)	1 click	Per-pixel masks	Stable	2.68 fps (H200)*

Fully software-based · Zero training · Minimal operator input · Per-pixel occlusion · Stable tracking

* Measured end-to-end on H200, 767-frame demo clip; ~30 fps real-time target is future work

07

Full Pipeline Overview

08

SAM2 model

We use Meta's SAM 2 as the backbone for banner segmentation: it requires Manual ROI selection on frame 0 — a single click, box, or mask prompt — and then tracks and segments the region across all subsequent frames.

Pre-trained on SA-1B dataset (11M images, 1B+ masks),trained on SA-V video dataset (50.9K videos,35.5M masks)
Prompt with points, boxes, or masksMemory bank for temporal consistency

SAM 2 architecture — image encoder, memory attention, mask decoder, memory bank

SAM 2 Architecture — Meta AI, “SAM 2: Segment Anything in Images and Videos” (arXiv:2408.00714)

09

Player
Segmentation

Detect and segment players on the court to ensure overlaid banners render behind them, preserving a natural viewing experience.

10

Banner Segmentation
& Tracking

Detect advertising banners across different regions of the court — ground-level boards, back wall panels, net-mounted banners, umpire stand signage, and more.

11

SAM2 - Banners Segmentation

Banners Stable Camera
Banners Moving Camera
Logos Stable Camera
Logos Moving Camera
Camera Cutsexperiment

12

SAM3 model

SAM 3 introduces Promptable Concept Segmentation: given a text prompt or positive/negative exemplar prompts, it detects, segments, and tracks all instances of a concept across video frames, enabling automatic ROI detection and tracking.

Text-prompted open-vocabulary detection
New Detector head + SAM 2 Tracker + Memory Bank
Detects new instances mid-video, no per-frame clicks

SAM 3 architecture — text encoder, image encoder, detector, tracker, memory bank

SAM 3 Architecture — Meta AI, “SAM 3: Segment Anything with Concepts” (arXiv:2511.16719)

M̂_t = propagate(M_t−1)O_t = detect(I_t, P)M_t = match_and_update(M̂_t, O_t)

13

SAM3-Light

A light version that runs SAM 3 (Detector + Tracker) only when the scene actually changes, and reuses the previous masks (Tracker) otherwise. This brings throughput from ~1 fps (full SAM 3) to ~3–4 fps on the same A100-80GB hardware.

14

SAM3 - Automatic detection

SAM 3

Static
Zoom + camera change P1
Zoom + camera change P2

SAM 3-Light

Static
Zoom + change (sim=0.85)
Zoom + change (sim=0.95)
Zoom + change (sim=0.97)

SAM 3Static1.001 fps

prompt “logo”

15

Full Pipeline Experiments

SAM 3
SAM 3-Light
SAM 3-Lightexperiment

SAM 31.83 fps

prompt “sponsor logo on fixed advertising board”

segmented objects 10

16

Homography &
Perspective Geometry

Understand the geometry and perspective of the banners from the camera's viewpoint so that overlays and modifications appear realistic to the viewer in the final livestream.

Motivation

Original frame

Detect region

Rectify to flat

New logo (flat)

Warp overlay back

16

Vanishing
Point

The vanishing point constrains how banner edges converge — it's not a true parallelogram but a perspective quadrilateral.

16

Quadrilateral
Fitting

Original frame
SAM mask
Binary mask
Min-area rectangle
Split along axes
Fit edge lines
Intersect corners
Rectified view

17

Single vanishing point

Before the BallTrackerNet learned-keypoint detector, the per-frame court geometry ran on a classical line-based estimator: Hough lines on court markings → cluster by orientation → estimate the depth vanishing point → fit the homography.

VP-constrained fitters
Wall banners and side panels are fitted with rays projecting from the depth or width vanishing point, gives geometrically correct perspective without solving full intrinsics.
Why it didn't make the final
The Hough estimator is frame-to-frame noisy, even with smoothing, projected corners deviate 5–15 px between frames on a static camera. Phase 2 swepttolerance_px ∈ {2..30}under hybrid_lock; only the always-locked V68 baseline passed all gates.
Path forward
Learned 14-keypoint detector (BallTrackerNet) replaces the Hough+VP estimator in the final. VP-constrained fitter code lives onfeat/court-geometry-stabilisation.

Vanishing point construction on the court geometry

depth VP estimated from court markings

18

New logo overlay

InpaintRemove the original logo from the surface (median_fill, temporal)
LED-blend brightness re-bakeMatch local surface luminance, read as painted, not pasted
Person-mask occlusionAlpha-matte the player silhouette so logos hide behind feet, legs, racket

stage

01Original broadcast
02Final composite
03Logo on the ground

01Original broadcast

Real baked-in ads on the back banners + floor wordmark (Kia, YoPRO, Melbourne).

19

Final result, region by region

Top row = unmodified original broadcast (real baked-in ads, our quality bar). Bottom row = our composite.

Back bannersLeft side bannerCourt floor logoFull frame

Back banners, paired original vs composite crop strip

Left side banner, paired original vs composite crop strip

Full frame, paired original vs composite crop strip

3 cropsobjectsobj_1 · obj_2 · obj_5surfacebanner

3 distinct back-wall banner positions, same frame (f0350). Temporal SSIM 0.9999 · jitter 0.291, visually identical to V68 gold.

20

Walkover sequence

Auto-detected walkover window: frames 685–723 · 5 key frames span entry → contact → exit

Entryf685Pre-contactf694Contactf704Post-contactf713Exitf723

Walkover forensic sheet frame 685, Entry

Walkover forensic sheet frame 694, Pre-contact

Walkover forensic sheet frame 704, Contact

Walkover forensic sheet frame 713, Post-contact

1original broadcast2clean court (no logo)3our composite4original − clean (Δ)5survival heatmap6leak overlay (red)

f685 · ENTRYPlayer begins entering the floor-logo region.

21

Demo

Original broadcastFinal compositeSide-by-side vs V68 gold

BEFORE

Original broadcast767 frames @ 60 fps from the Melbourne broadcast, the input to the pipeline.

Video loops automatically. Use ← / → to step through.

22

Headline numbers

The final-run results in one screen, what the eval framework measured onexperiments/2026-05-05_18-38-39_hull_H200/.

5

Simultaneous virtual placements3 back banners · 1 left side · 1 court-floor walkover logo

0.9999

Temporal SSIM · back bannersVisually identical to V68 gold (locked frames)

0.985

Walkover occlusion IoUPlayer on the floor logo, frames 685–723. Gate is >0.80.

767frames

Demo clip · 60 fps13 seconds from the Melbourne broadcast

~50

H200 GPU runs · 14 iteration wavesPhase 3 experiment cycles, < 14 h wall clock

13 × 5

Rubric dimensions × regions scoredVisual rubric across realism · color · geometry · temporal

one-liner
V68 manually-clicked corners + BallTrackerNet dynamic homography + hybrid_lock at 30-px tolerance + V68's LED-blend compositor, five placements, broadcast-stable, occluded correctly through a player walkover.

23

Evaluation, three layers

Every candidate gets scored at three layers before it can ship: deterministic numerical gates, a structured visual rubric, and direct side-by-side review against the original baked-in ads.

layer 1

Numerical metrics

Per-region scorecards. Each region must pass all gated metrics; one warning-only set (ΔE, noise variance, edge sharpness) surfaces but doesn't gate.

corner_max_jump_px< 2.0
corner_accel_p95_px< 1.0
quad_area_cv< 0.05
roi_jitter_ratio≤ 1.05
roi_temporal_ssim_mean> 0.95
walkover_logo_visible_pct> 0.10
walkover_occlusion_iou> 0.80

Exit code 0 = pass · 2 = scorecard fail · 3 = regression vs gold reference

layer 2

Visual rubric

Per-region 1–5 score across 13 dimensions. Inputs are paired top=original / bottom=composite crop strips so the original baked-in ad is always the comparison anchor.

Realism

painted_on vs pasted_on
edge_seam_visibility
texture_match
halo_presence
edge_reflex

Color

hue_match
brightness_match
saturation_match

Geometry

perspective_plausibility
size_plausibility

Temporal (floor / walkover)

occlusion_realism
jitter_visible
player_contact_shadow

layer 3

Direct visual review

final

Human review of the actual output video against the original broadcast. The accept/reject vote; tie-breaks the numerical and rubric layers when they disagree.

example tie-break

A rubric-favorite candidate (layered shadow synthesis + aggressive erase_text + tight banner padding) scored 5/5 on the v2 user-flagged dimensionsand passed every numerical gate.

On direct viewing: floor shadow read as blob, MELBOURNE wordmark erasure flattened the floor context, harder banner edges read as "pasted on".

→ Final pick: P3-A1, the simpler BTN port baseline, with V68's compositor unchanged.

takeaway
Numerical gates and the visual rubric are great regression detectors; the rubric is great for surfacing dimensions worth inspecting. But the final accept/reject decision needs a human looking at the actual video against the original broadcast, the metrics we ship are the ones we'd defend on screen, not just on paper.

24

Modal + Speed Benchmarking

Every experiment ran on Modal, a serverless GPU platform. Up to 10 concurrent GPUs let us run parallel waves of experiments, ~50 H200 runs across 14 waves of iteration in Phase 3. Configs are frozen per-run; outputs land in experiments/<timestamp>_<gpu>/.

Modal GPU matrix

$/hr

T416 GB$0.59
L424 GB$0.80
A10G24 GB$1.10
L40S48 GB$1.95
A10040 GB$2.10
A100-80GB80 GB$2.50
H10080 GB$3.95
H200 (final)141 GB$4.54
B200192 GB$6.25

Final run, P3-A1

Frames767

Input fps59.0

End-to-end286 s

Output fps2.68

GPUH200

VRAM used139.8 GB

Parallelism unlock

10 concurrent H200 slots let us launch 5–8 parallel runs per wave. Phase 3 ran ~50 runs in <14 hours wall clock, most of that budget went on per-cycle evaluation + visual review, not GPU inference itself.

Future Improvements

Texture

Texture-match ceiling

Smoothed inpaint micro-grain visible vs gritty court paint at close zoom
Needs real texture transfer (noise injection / GAN-based inpaint)

Smoothing

Adaptive vp_smoothing

Code shipped (P3-A2) but the parameter sweep didn't conclude
Motion-aware EMA could lift walkover-window stability further

Speed

Real-time performance

End-to-end ~2.7 fps on H200 today
Target 30 fps for broadcast-viable; needs model + pipeline optimisation

Auto

Refine pipeline for Automatic ROI detection

Best prompt so far: "sponsor logo on fixed advertising board" (~3.95 fps on A100-80GB)

Cuts

Camera cuts + zoom

Detect angle transitions; reseed homography on cuts
Validate on data/zoom-clip-melbourne.mov + multi-clip eval

Sports

Other sports

Pipeline supports any clip via configs/eval/reference.yaml
Generalize beyond tennis (football, basketball)

Thank You

Raghav, Enrique, Giovanni, Martina