What is it about?

Existing methods for video answer localization (VAL) in instructional video focus predominantly on coarse-grained themes, failing to address detailed step content and inter-step relation crucial for effective comprehension. Current datasets, such as MedVidQA, primarily capture video content but lack annotations for step structure and inter-step relation. To address this gap, we introduce InstructStep, a newly proposed VAL task, specifically designed for Instructional Video Step Content and Relation Localization. It extends original VAL task to step-centric content and relation. Accordingly, we create a InstructStep Dataset with fine-grained step content and relation QA pairs. To tackle the challenges of this task, we propose a Step-Centric Multi-Level Knowledge Distillation (SC-MLKD) approach that: (1) A two-stage training strategy that generates step-specific summaries in the first stage and introduces a step branch in the second stage to learn step relations. This is applied only during training, ensuring no additional inference time. (2) Multi-level knowledge distillation, including feature, step, and response distillation, across visual, text, and step branches to capture fine-grained and step-centric features.

Featured Image

Why is it important?

Localization in Instructional Videos can enhance people's efficiency in acquiring knowledge, and the most crucial element of Instructional videos lies in the Instructional steps. Our method can achieve improvements in this aspect.

Perspectives

Writing this article has been a very pleasant experience. As my first research endeavor, I gained a great deal of knowledge during the process of completing it, including writing skills and experimental methods, which has improved my writing and experimental abilities.

Wangsheng He
Beijing Jiaotong University

Read the Original

This page is a summary of: InstructStep: Fine-Grained Localization of Step Content and Relation in Instructional Video, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746027.3754999.
You can read the full text:

Read

Contributors

The following have contributed to this page