Google Proposed VLOGGER an AI that can Create Videos

Google Research has proposed an AI that can videos using image and audio clips.

Google Research has proposed an artificial intelligence that has the ability to generate small videos using a still photograph. The researchers at Google are working on this system and right now this system can generate a real-life video that includes people speaking, moving, and also gesturing. This video is created using a single still photo. The name of this technology is called VLOGGER and is designed with advanced machine-learning models. That has the ability to create real-life videos of people. This new proposal has opened new possibilities for applications that can use this technology and increase the productivity of video creators.

As described in the paper “VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis” the AI model can take a photo of a person and an audio clip as input and then create an output video that matches the audio that seems like the person is speaking the exact words and also expressing the same. The video created has some glitches and is not perfect but has allowed technology to make a significant development as it can animate still images.

How Does This Technology Work?

This system was proposed and published on the blog page of Google Research. In this paper, The main motive of this AI is to create a photorealistic video of humans at different lengths according to the time limit of the audio clip. This system is based on a staged pipeline that is based on stochastic diffusion models that have the ability to model the one-to-many mapping from speech to video. The first stage takes audio as an input which is run by the system through its database to generate motion control and gestures that lead to gaze, facial expressions, and poses. This data is compiled according to the target video length.

In the second network, the data is a temporary image-to-image translation model that helps the system generate relevant body controls for the frames. For identity, the system takes a reference image of the person.

For this, the user has to upload a picture capturing a gesture that helps the system predict the gestures and an audio clip for predicting the facial expressions. With this data, the system can predict facial expressions and gestures and generate a video clip that looks real.

This system can also edit an existing video by changing the expression, motion, and gestures. This is done by inpainting parts of the image which need change which helps the video to look realistic as the pixels remain unchanged. It also has the ability to generate videos in different languages. It can also convert the original audio to different languages and also make the necessary changes in the video according to the tone and dialect of that language it also adjusts the expressions and gestures according to the language. The model used in this system is called the diffusion model which recently has shown remarkable development for generating highly realistic images using text descriptions. Introducing this model to generating video has opened a new dataset that gives this system the ability to generate highly realistic video from images.

The dataset called MENTOR which has over 800,000 diverse identities and 2200 hours of video data that was previously available helps VLOGGER to learn to generate videos of people with varied ethnicities, ages, clothing, poses and surroundings.

The authors of the research paper which proposed VLOGGER said “In contrast to previous work, our method does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (.e.g. visible torso or diverse subject identities0 that are critical to correctly synthesize humans who communicate,”

With this, the proposal of Google Research under Enric Corona. There are concerns about an increase in deepfake videos using this technology and also fake news. This will also escalate the problem of deepfake videos if used by the wrong hands. Further Google considers VLOGGER as a step towards “embodied conversational agents” that have the ability to engage with humans using natural speech, gestures, and eye contact. As stated in the paper “VLOGGER can be used as a stand-alone solution for presentations, education, narration, low-bandwidth online communication, and as an interface for text-only human-computer interaction,” The abilities and features of VLOGGER can increase the risk of deepfake videos. It can increase the productivity of creators who want to create good content as it will reduce the time taken to create a video and also cover more content. But, if fallen into the wrong hands it can create a ruckus on the Internet as it can be used to spread misinformation.