Logo VideoChat-Flash

Hierarchical Compression for Long-Context Video Modeling


1OpenGVLab, Shanghai AI Laboratory  2Nanjing University 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
*Equal contribution  ☨Corresponding authors

Abstract

Long-context modeling is a critical capability for multimodal large language models (MLLMs), enabling them to process long-form contents with implicit memorization. Despite its advances, handling extremely long videos remains challenging due to the difficulty in maintaining crucial features over extended sequences. This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation and a practical context modeling system VideoChat-Flash tailored for multimodal long-sequence processing. HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level, reducing the compute significantly while preserving essential details. VideoChat-Flash features a multi-stage short-to-long learning scheme, a rich dataset of real-world long videos named LongVid, and an upgraded "Needle-In-A-video-Haystack" (NIAH) for evaluating context capacities. In extensive experiments, VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 7B model scale. It firstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.

New Model: VideoChat-Flash

🚀State-of-the-art performance in short and long video understanding, with temporal localization capabilities comparable to expert models.

🔭Supports ultra-long video inputs, achieving a groundbreaking needle-in-a-haystack evaluation accuracy of 99.1% on 10,000 frames, capable of processing videos up to three hours long.

⚡Highly efficient model architecture with exceptional inference speed, encoding each video frame into just 16 tokens, making it 5–10 times faster than the previous model.

New Benchmarks: Multi-hop NIAH

We propose a new benchmark, termed multi-hop needle in a video haystack. It rigorously evaluates the model’s ability to process extremely long contexts by demanding it to locate a sequence of interconnected indi- cating images embedded within a long video.

Citation

@article{li2024videochat,
        title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
        author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
        journal={arXiv preprint arXiv:2501.00574},
        year={2024}
      }
  }