What is it about?

An introduction into self-supervised learning for videos and a summary of the current research landscape. The main areas include: 1) pretext task learning, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. In addition to covering self-supervised learning for video in vision only, we include multimodal approaches that use additional modalities like audio and text. More info can be found at our GitHub project link: https://bit.ly/3Oimc7Q

Featured Image

Why is it important?

Self-supervised learning reduces the requirement of dense annotation for training and provides generalizable foundation models that can be used for downstream tasks or emergent behaviors.

Perspectives

I hope this article can help researchers new to the field as a gentle introduction into the topic of self-supervised learning for videos and to help guide future research.

Madeline Chantry
University of Central Florida

Read the Original

This page is a summary of: Self-Supervised Learning for Videos: A Survey, ACM Computing Surveys, July 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3577925.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page