Advertisement

Baidu's Hallo3 - Voice-driven portrait image animation

), and they recently released Hallo3.

Effect Demonstration

Method Demonstration

Given a reference image, audio sequence, and text prompt, this method generates a dynamic avatar from the front or different perspectives while maintaining identity consistency over longer periods of time. Additionally, it incorporates dynamic foreground and background elements, ensuring temporal consistency and high visual fidelity.

Method Overview

This method generates video output with temporal consistency and high visual fidelity by taking as input a reference image, audio sequence, and text prompt. Hallo3 utilizes Casual 3D VAE, T5, and Wav2Vec models to process visual, textual, and audio features respectively. The identity reference network extracts identity features from the input reference image and text prompts, enabling controllable animation while maintaining subject appearance consistency. The audio encoder generates motion information synchronized with lip movements, while the facial encoder extracts facial features to maintain consistent facial expressions. The 3D full attention module and audio attention module combine identity and motion data in the denoising network, generating high-fidelity, temporally consistent, and controllable dynamic videos.

Audio Conditioning Strategy

  1. Self-Attention
  2. Adaptive Normalization
  3. Cross-Attention

Identity Conditioning Strategy

FE refers to the Facial Encoder.Cross-attention strategy performs the best:
  1. Facial Attention
  2. Facial Adaptive Normalization
  3. Identity Reference Network
  4. Facial Attention and Identity Reference Network

Scene

  • Dynamic Scene
  • Diverse Head Poses
  • Portraits with Headwear
  • Portraits Interacting with Objects