VidSitu Dataset: Situation Recognition in Videos

Overview

Videos often depict complex real-life situations. While the vision community continues to make progress on building blocks like activity recognition, temporal localization, and tracking, the goal of this challenge is to strive for more complete video understanding than can be afforded by action labels and object detection only. Specifically, this challenge evaluates the ability of vision algorithms to understand complex related events in a video. Each event may be described by a verb corresponding to the most salient action in a video segment and its semantic roles. VidSRL involves 3 sub-tasks: (1) predicting a verb-sense describing the most salient action; (2) predicting the semantic roles for a given verb; and (3) predicting event relations given the verbs and semantic roles for two events.

Why not just study video captioning?

While producing accurate and descriptive captions for videos would be a remarkable display of algorithmic video understanding, VidSRL provides a more complete and structured representation of related events, and a more thorough three-part evaluation.

Data

The challenge is based on the VidSitu dataset which consists of over 80 hours of richly annotated movie-clips. VidSitu has 29.2K 10-second videos annotated with events every 2 seconds. We refer to each 2 second segment as a clip. The train, validation and test splits are desribed below. Note that the train and validation annotations are available for download where as the test annotations are only available through the online evaluation server.

Download VidSitu Dataset

Split	Description
Train	23626 videos with each of the five 2-second clips within a video is annotated with a verb and corresponding semantic roles
Validation	1326 videos annotated similar to the train set
Test-Vb	1353 videos for evaluating verb-sense prediction. Each clip is annotated with 3 verbs to account for variability in selection of the salient activity
Test-SRL	1598 videos for evaluating semantic role prediction for a chosen verb. Each clip is annotated with 3 sets of semantic roles for the same verb to account for variability in referring expressions.
Test-ER	1317 videos for evaluating event-relation prediction. Each clip is annotated with relation to the central event (the 4-6 second clip). We provide 3 annotations to handle relation ambiguity.

Submission

The VidSRL evaluation has three parts, each with its own leaderboard. For the purpose of the callenge, we will announce winners for each leaderboard. Each leaderboard requires submitting a single .pkl file of predictions for the corresponding test split. Detailed submission instructions, including prediction formats required for evaluation, are available on the respective leaderboards. Below, video refers to a 10-second video segment which consists of the 5 non-overlapping 2-second clips and the central clip refers to the segment spanning the 4-6 second interval within the video.

Input	Output	Test Split
Video	Verb-sense per clip	Test-Vb
Video and verbs for each clip	Semantic roles per clip	Test-SRL
Video, verbs, and semantic roles for each clip	Relation of each clip to the central clip	Test-ER

Schedule

When	What
April 5	Evaluation servers and leaderboards online
June 4	Submission deadline for challenge purposes
June 19-25	Winners announced during the ActivityNet Workshop @ CVPR 2021

Overview

Why not just study video captioning?

Data

Submission

Schedule

Contact