This AI Paper from China Introduces Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

There has been a recent uptick in the development of general-purpose multimodal AI assistants capable of following visual and written directions, thanks to the remarkable success of Large Language Models (LLMs). By utilizing the impressive reasoning capabilities of LLMs and information found in huge alignment corpus (such as image-text pairs), they demonstrate the immense potential for effectively understanding and creating visual content. Despite their success with image-text data, adaptation for video modality is underexplored in these multimodal LLMs. Video is a more natural fit with human visual perception than still images because of its dynamic nature. To improve AI’s ability to understand the real world, it is very important to learn from video successfully.

By investigating a time-saving video representation that breaks down video into keyframes and temporal motions, a new study by Peking University and Kuaishou Technology overcomes the shortcomings of video-language pretraining. Their work is majorly inspired by the inherent qualities of video data that provide the basis. Most videos are split into multiple shots, and there is usually much redundant information in the video frames within each shot. Including these frames in the generative pretraining of LLMs as tokens is unnecessary.

Keyframes contain the main visual semantics, and motion vectors show the dynamic evolution of their corresponding keyframe over time; this fact strongly motivates us to divide each movie into these alternating halves. Such deconstructed representation has multiple advantages:

Utilizing motion vectors with a single keyframe is more efficient for large-scale pretraining than processing consecutive video frames using 3D encoders because it requires fewer tokens to express video temporal dynamics.
Instead of starting from zero when it comes to modeling time, the model can use the visual knowledge it has gained from a pre-made image-only LLM for its own purposes.

For these reasons, the team has introduced Video-LaVIT (Language-VIsion Transformer). This novel multimodal pretraining method equips LLMs to understand and produce video material within a cohesive framework. Video-LaVIT has two main components to manage video modalities: a tokenizer and a detokenizer. By employing an established image tokenizer to process the keyframes, the video tokenizer attempts to convert the continuous video data into a sequence of compact discrete tokens similar to a foreign language. Encoding spatiotemporal motions can be encoded by transforming them into a corresponding discrete representation. It greatly improves LLMs’ capacity to understand complex video actions by capturing the time-varying contextual information in retrieved motion vectors. The video detokenizer restores the original continuous pixel space from which the discretized video token produced by LLMs was originally mapped.

Users may optimize video during training using the same next token prediction objective with different modalities since the video is an alternating discrete visual-motion token sequence. This combined autoregressive pretraining aids in understanding the sequential relationships of various video clips, which is important because video is a time series.

As a multimodal generalist, VideoLaVIT showed promise in understanding and generating tasks even without additional tuning. Results from extensive quantitative and qualitative tests show that Video-LaVIT outperforms the competition in various tasks, including text-to-video and picture-to-video production, video and image understanding, and more.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.