VidSitu
Towards understanding situations in videos

VidSitu is a large-scale dataset containing diverse 10-second videos from movies depicting complex situations (a collection of related events). Events in the video are richly annotated at 2-second intervals with verbs, semantic-roles, entity co-references, and event relations.

Dataset Statistics

Large-Scale

3K Movies, 29K 10-second Movie Clips, 145K Events

Diverse Videos

Videos in VidSitu are diverse. 224 Verbs appear in at least 100 Events. 336 Distinct Nouns appear in at least 100 Videos

Complex Videos

Videos in VidSitu are complex. More than 80% of the videos have at least 4 unique verbs and 70% of the videos have at least 6 unique entities.

Rich Annotations

Each Video in VidSitu is annotated with rich structured representations of events that includes verbs, semantic role labels, entitycoreferences, and event relations.

VidSRL Task

Annotations in VidSitu support the Video Semantic Role Labeling (VidSRL) task which consists of 3 subtasks.

Given a 2-second clip, predict a verb-sense describing the most salient action.

Given a verb sense, generate the semantic roles for each 2-second interval. Entities within and across time-steps should be co-referenced.

Given the verbs and semantic roles for two events, predict how the events are related to each other by classifying among 4 event-relation types.

Paper

Read the Paper

If you find our work helpful, please cite the following paper:

@InProceedings{Sadhu_2021_CVPR,
          author = {Sadhu, Arka and Gupta, Tanmay and Yatskar, Mark and Nevatia, Ram and Kembhavi, Aniruddha},
          title = {Visual Semantic Role Labeling for Video Understanding},
          booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
          month = {June},
          year = {2021}}