What is it about?
Existing methods for video answer localization (VAL) in instructional video focus predominantly on coarse-grained themes, failing to address detailed step content and inter-step relation crucial for effective comprehension. Current datasets, such as MedVidQA, primarily capture video content but lack annotations for step structure and inter-step relation. To address this gap, we introduce InstructStep, a newly proposed VAL task, specifically designed for Instructional Video Step Content and Relation Localization. It extends original VAL task to step-centric content and relation. Accordingly, we create a InstructStep Dataset with fine-grained step content and relation QA pairs. To tackle the challenges of this task, we propose a Step-Centric Multi-Level Knowledge Distillation (SC-MLKD) approach that: (1) A two-stage training strategy that generates step-specific summaries in the first stage and introduces a step branch in the second stage to learn step relations. This is applied only during training, ensuring no additional inference time. (2) Multi-level knowledge distillation, including feature, step, and response distillation, across visual, text, and step branches to capture fine-grained and step-centric features.
Featured Image
Why is it important?
Localization in Instructional Videos can enhance people's efficiency in acquiring knowledge, and the most crucial element of Instructional videos lies in the Instructional steps. Our method can achieve improvements in this aspect.
Perspectives
Writing this article has been a very pleasant experience. As my first research endeavor, I gained a great deal of knowledge during the process of completing it, including writing skills and experimental methods, which has improved my writing and experimental abilities.
Wangsheng He
Beijing Jiaotong University
Read the Original
This page is a summary of: InstructStep: Fine-Grained Localization of Step Content and Relation in Instructional Video, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746027.3754999.
You can read the full text:
Contributors
The following have contributed to this page







