InternVideo: Video Foundation Models for Multimodal Understanding

InternVideo Team
OpenGVLab, Shanghai AI Laboratory

Video Foundation Model Family

2022/12/06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications, obtaining 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Sth-V2 benchmarks, respectively.

2023/07/13
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
A large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.

2024/03/22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
A new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue (over 60 video and audio tasks). Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters.

Multimodal Large Language Model

2023/03/10
VideoChat: Chat-centric Video Understanding
First attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference.

2023/11/28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
A comprehensive Multi-modal Video understanding Benchmark namely MVBench which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Moreover we further develop a robust video MLLM baseline i.e. VideoChat2 by progressive multi-modal training with diverse instruction-tuning data.
2024/12/31
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
A efficient video MLLM that shows the leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scale. It firstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.

Citation


@article{wang2024internvideo2,
title={Internvideo2: Scaling video foundation models for multimodal video understanding},
author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
journal={arXiv preprint arXiv:2403.15377},
year={2024}
}
            
@article{wang2022internvideo,
title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2212.03191},
year={2022}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.