OmniPro

Abstract

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models.

We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis.

We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.

Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

Key Features

Omni-Modal

84% audio dependency with modality-isolation labels for fine-grained analysis.

Proactive Responding

Models decide when to speak with multi-trigger support and over-trigger penalties.

Diverse Tasks

9 sub-tasks covering Alert, Monitoring, Grounding, Counting, Narration, Prediction.

Sample Visualization

Use the arrows to browse examples from each sub-task. Each sample shows the video, user instruction, and ground-truth proactive responses.

Event-Alert 🎧 Audio Required

Q: Let me know if you hear the sound of pots and pans clinking in the kitchen.

00:13 sound
"She's rummaging through the cabinets now — I hear the pots clinking."

State-Monitor 🎧 Audio Required

Q: Monitor the audio and let me know whenever the narrator starts or stops speaking.

00:10 speech
"The narrator started speaking — changed from music only to speech."

01:15 speech
"The narrator stopped speaking — changed from speech back to music only."

01:30 speech
"The narrator resumed speaking — changed from music only to speech."

Snap.-Count 🎧 Audio Required

Q: When the speaker says he will pronounce the word slower, count how many letters are in the word on the screen.

00:34 visual+speech
"He's speaking slower now — I count 8 letters in the word."

Target-Ground 🎧 Audio Required

Q: The first time you hear the sound of a bell ringing, tell me where the bell is in the frame.

00:10 visual+sound
"The bell just rang—it is located in the top-center of the frame."

Cond.-Alert 🎧 Audio Required

Q: Alert me whenever one of the children makes a dismissive or negative comment about an idea or suggestion shared by someone else.

00:08 speech
"A boy just dismissed a friend's suggestion about an NBA scout by calling it 'dumb'."

00:27 speech
"A boy criticized the name 'Fred' by calling it a 'total wrinkle-up geezer name'."

00:32 speech
"A boy just insulted a name suggestion, calling 'Bob' a 'lame name'."

Cum.-Count 🎧 Audio Required

Q: Count how many times the word 'Devoxx' is spoken throughout the video.

00:21 speech
"Yakov Fain just mentioned Devoxx for the first time while describing the conference tracks."

00:37 speech
"Baruch Sadogursky just said Devoxx while discussing the high quality of the speakers."

00:52 speech
"Yury Antoniuk just mentioned Devoxx, calling it a great engineering conference in Ukraine."

01:42 speech
"Stephan Janssen mentioned Devoxx once more, comparing the event to a rock concert."

Event-Narr. 🔊 Audio Helpful

Q: Narrate the progression of Kenneth's prank, from his transformation to his performance at the outdoor gym.

00:45 visual+speech
"Kenneth, an athlete, is undergoing a transformation into an elderly man. Makeup artists are applying latex prosthetics and airbrushing wrinkles while he describes the tight sensation of the bald cap."

01:40 visual+speech
"Now fully disguised, Kenneth is at Muscle Beach. After joking about using prune juice as a supplement, he is approaching a group of lifters to ask if he can try their heavy weights."

02:20 visual+sound
"After pretending to struggle, Kenneth shocked onlookers by cleaning and jerking the heavy weight with ease. He is currently performing deep squats and overhead lifts while the crowd cheers and records his performance."

03:15 visual+sound
"Kenneth matched a younger lifter's heavy lifts, drawing a large crowd. He is now performing a perfect handstand on the ground while onlookers cheer and capture the moment on their phones."

03:35 visual+speech
"Kenneth is high-fiving the other lifters and waving to the audience. He is currently sharing a message about staying strong every day as he prepares to leave the beach area."

Dedup.-Count 🔊 Audio Helpful

Q: Count how many different people are interviewed and speak on camera in this video.

00:18 visual+speech
"First interviewee — Marwan Farah, a resident engineer, begins speaking about the project's goals."

00:24 visual+speech
"Second interviewee — Heather Pehrson from the city council shares her perspective on community transportation."

00:45 visual+speech
"Third interviewee — Utah State Senator Howard Stephenson speaks at a podium praising the project."

Step-Inst. 🔊 Audio Helpful

Q: I'm following along to wrap a gift like this. Please guide me through each step as it happens.

00:08 visual
"Place the box on your cut paper. Wrap the paper tightly around the box and secure the seam with a piece of clear tape."

00:15 visual
"Fold the paper over the ends of the box, creating neat triangular folds. Secure each side with tape to fully encase the box."

00:32 visual
"Pull out a long section of gold ribbon, measure it against the box to ensure it's long enough, and cut it with your scissors."

00:44 visual
"Wrap the gold ribbon around the box, crossing it at the bottom and bringing the ends back to the top to prepare for the bow."

00:59 visual
"Tie an initial knot on top. Begin forming multiple loops with the gold ribbon, holding the center firmly with your thumb as you go."

01:45 visual
"Tie a final knot to secure the bow. Use your scissors to trim the excess ribbon ends at a sharp angle for a professional look."

Key Findings

Proprietary vs. open-source gap: Gemini-3-Flash achieves 40.4% accuracy, nearly double the best open-source model (22.6%), with the largest gap on reasoning-level tasks.
Divergent modality utilization: Audio and video provide complementary cues (A+V gains range from +2.4 to +11.1), yet models exhibit divergent patterns—some rely on audio, others on vision.
Long-horizon degradation: All models struggle to perceive events occurring late in videos, retaining only 37% of early-segment performance at 180s+.
Non-speech sound is the weakest link: All models perform worst on visual+sound triggers (15.3–22.3%), revealing a shared bottleneck.

OmniPro

A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Abstract

Key Features

Omni-Modal

Proactive Responding

Diverse Tasks

Sample Visualization

Dataset Statistics

Audio Dependency

Trigger Modality

Trigger Word Cloud

Trigger Time

Experimental Analysis

Temporal Position Ablation

Trigger Modality Radar

Key Findings

BibTeX