Towards understanding situations in videos
VidSitu is a large-scale dataset containing diverse 10-second videos from movies depicting complex situations (a collection of related events). Events in the video are richly annotated at 2-second intervals with verbs, semantic-roles, entity co-references, and event relations.
3K Movies, 29K 10-second Movie Clips, 145K Events
Videos in VidSitu are diverse. 224 Verbs appear in at least 100 Events. 336 Distinct Nouns appear in at least 100 Videos
Videos in VidSitu are complex. More than 80% of the videos have at least 4 unique verbs and 70% of the videos have at least 6 unique entities.
Each Video in VidSitu is annotated with rich structured representations of events that includes verbs, semantic role labels, entitycoreferences, and event relations.
Annotations in VidSitu support the Video Semantic Role Labeling (VidSRL) task which consists of 3 subtasks.
If you find our work helpful, please cite the following paper:
@InProceedings{Sadhu_2021_CVPR,
author = {Sadhu, Arka and Gupta, Tanmay and Yatskar, Mark and Nevatia, Ram and Kembhavi, Aniruddha},
title = {Visual Semantic Role Labeling for Video Understanding},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021}}
Detailed instructions for downloading VidSitu are provided on the accompanying Github repo. This repo provides:
Links and download scripts to set up the dataset (train, validation and test sets)
Annotations for train and validation sets
Video features extracted using pretrained I3D and SlowFast models
All code is available on the accompanying Github repo. We provide code for:
Download Scripts and Data Loaders
Baseline Models with Config Files
Evaluation scripts and leaderboard instructions
Please reach out to Arka Sadhu (asadhu@usc.edu) for any queries.