OmniPro

A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Anonymous Authors
OmniPro Overview

Figure 1. OmniPro: 9 sub-tasks Γ— 3 cognitive levels Γ— 6 capabilities, with 84% audio dependency.

Abstract

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models.

We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis.

We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.

Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

Key Features

Omni-Modal

84% audio dependency with modality-isolation labels for fine-grained analysis.

Proactive Responding

Models decide when to speak with multi-trigger support and over-trigger penalties.

Diverse Tasks

9 sub-tasks covering Alert, Monitoring, Grounding, Counting, Narration, Prediction.

Sample Visualization

Use the arrows to browse examples from each sub-task. Each sample shows the video, user instruction, and ground-truth proactive responses.

Dataset Statistics

Audio Dependency

Audio dependency

Trigger Modality

Trigger modality

Trigger Word Cloud

Word cloud

Trigger Time

Trigger time

Experimental Analysis

Temporal Position Ablation

Temporal ablation

Trigger Modality Radar

Radar

Key Findings

  • Proprietary vs. open-source gap: Gemini-3-Flash achieves 40.4% accuracy, nearly double the best open-source model (22.6%), with the largest gap on reasoning-level tasks.
  • Divergent modality utilization: Audio and video provide complementary cues (A+V gains range from +2.4 to +11.1), yet models exhibit divergent patternsβ€”some rely on audio, others on vision.
  • Long-horizon degradation: All models struggle to perceive events occurring late in videos, retaining only 37% of early-segment performance at 180s+.
  • Non-speech sound is the weakest link: All models perform worst on visual+sound triggers (15.3–22.3%), revealing a shared bottleneck.

BibTeX

@inproceedings{omnipro2026,
  title={OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding},
  author={Anonymous},
  year={2026}
}