VidSitu
Towards understanding situations in videos

VidSitu is a large-scale dataset containing diverse 10-second videos from movies depicting complex situations (a collection of related events). Events in the video are richly annotated at 2-second intervals with verbs, semantic-roles, entity co-references, and event relations.



Dataset Statistics

Large-Scale

3K Movies, 29K 10-second Movie Clips, 145K Events

Diverse Videos

Videos in VidSitu are diverse. 224 Verbs appear in at least 100 Events. 336 Distinct Nouns appear in at least 100 Videos

Complex Videos

Videos in VidSitu are complex. More than 80% of the videos have at least 4 unique verbs and 70% of the videos have at least 6 unique entities.

Rich Annotations

Each Video in VidSitu is annotated with rich structured representations of events that includes verbs, semantic role labels, entitycoreferences, and event relations.


VidSRL Task

Annotations in VidSitu support the Video Semantic Role Labeling (VidSRL) task which consists of 3 subtasks.

Given a 2-second clip, predict a verb-sense describing the most salient action.

Given a verb sense, generate the semantic roles for each 2-second interval. Entities within and across time-steps should be co-referenced.

Given the verbs and semantic roles for two events, predict how the events are related to each other by classifying among 4 event-relation types.


Paper

If you find our work helpful, please cite the following paper:

@InProceedings{Sadhu_2021_CVPR,
          author = {Sadhu, Arka and Gupta, Tanmay and Yatskar, Mark and Nevatia, Ram and Kembhavi, Aniruddha},
          title = {Visual Semantic Role Labeling for Video Understanding},
          booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
          month = {June},
          year = {2021}}
        


Download VidSitu

Detailed instructions for downloading VidSitu are provided on the accompanying Github repo. This repo provides:

YouTube Video Ids

Links and download scripts to set up the dataset (train, validation and test sets)

Annotations

Annotations for train and validation sets

Video Features

Video features extracted using pretrained I3D and SlowFast models

Download VidSitu Dataset

Code

All code is available on the accompanying Github repo. We provide code for:

Data Setup

Download Scripts and Data Loaders

Baselines

Baseline Models with Config Files

Evaluation

Evaluation scripts and leaderboard instructions

Get Code

Contact


Please reach out to Arka Sadhu (asadhu@usc.edu) for any queries.