Artificial intelligence is being used in all spheres of life. Consequently, it is also used in the field of video generation. However, video generation has been challenging in artificial intelligence research, particularly when creating smooth and coherent large motions. The Current leading models struggle to develop such motions without exhibiting noticeable artifacts.
Consequently, the researchers of Google AI have formulated VideoPoet, a large language model capable of various video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting, and outpainting. What makes it better than other video generation is its ability to integrate multiple video generation capabilities within a model instead of relying on separate components. This unique approach allows VideoPoet to handle various video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio.
Videopoet uses tokenizers for video generation. It uses MAGVIT V2 for video and picture processing and SoundStream for audio. The model can learn from these tokenizers across text, image, audio, and video modalities. The tokenizer decoders can transform tokens that the model produces depending on a certain context back into a format that can be seen. Training an autoregressive language model spanning text, image, audio, and video modalities is a key component of VideoPoet’s design.
Users provide a piece of text as input, and VideoPoet creates a corresponding video sequence. These output videos can be of variable length and display a broad range of motions and styles, depending on the text’s content. Also, the model predicts optical flow and depth information in video stylization before incorporating additional input text into the generation process. Remarkably, VideoPoet can also generate audio, integrating video and audio within a single model. By default, VideoPoet generates videos in portrait orientation, catering to the demand for short-form content.
The researchers created a brief movie of clips generated by VideoPoet to showcase its ability to extend videos while preserving object appearances over several iterations of image-to-video conversion. Further, users can provide an image as input along with a prompt describing how they want the picture animated. In response, VideoPoet produces a visually appealing animation. VideoPoet also includes built-in audio generation capabilities. After generating two-second sound clips, the model attempts to predict subsequent audio segments autonomously, enabling end-to-end video and audio creation from a single model.
The researchers evaluated VideoPoet on various benchmarks and found that it demonstrates its highly competitive quality across different benchmarks, emphasizing its potential in the video generation field. The model’s ability to produce captivating and high-quality motions suggests a promising future for LLMs in video generation.
In conclusion, VideoPoet’s ability to generate text-to-audio, audio-to-video, and video captioning tasks is significant. These abilities establish VideoPoet as a leader in the continuing development of AI-driven video generation, opening up new opportunities in multimedia content production. It can be a significant leap forward in video generation technology, offering vast potential for multimedia artists and researchers.
Check out the Google Research Article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
The post Google Research Introduces VideoPoet: A Large Language Model for Zero-Shot Video Generation appeared first on MarkTechPost.
#AIShorts #Applications #ArtificialIntelligence #EditorsPick #Staff #TechNews #Technology #Uncategorized [Source: AI Techpark]