Baidu's Hallo3 - Voice-driven portrait image animation

), and they recently released Hallo3.

Effect Demonstration

Method Demonstration

Given a reference image, audio sequence, and text prompt, this method generates a dynamic avatar from the front or different perspectives while maintaining identity consistency over longer periods of time. Additionally, it incorporates dynamic foreground and background elements, ensuring temporal consistency and high visual fidelity.

Method Overview

This method generates video output with temporal consistency and high visual fidelity by taking as input a reference image, audio sequence, and text prompt. Hallo3 utilizes Casual 3D VAE, T5, and Wav2Vec models to process visual, textual, and audio features respectively. The identity reference network extracts identity features from the input reference image and text prompts, enabling controllable animation while maintaining subject appearance consistency. The audio encoder generates motion information synchronized with lip movements, while the facial encoder extracts facial features to maintain consistent facial expressions. The 3D full attention module and audio attention module combine identity and motion data in the denoising network, generating high-fidelity, temporally consistent, and controllable dynamic videos.

Audio Conditioning Strategy

Self-Attention
Adaptive Normalization
Cross-Attention

Identity Conditioning Strategy

FE refers to the Facial Encoder.Cross-attention strategy performs the best:

Facial Attention
Facial Adaptive Normalization
Identity Reference Network
Facial Attention and Identity Reference Network

Scene