All Stories

  1. Cross-modal Token Selection for Video Understanding