Byte's latest open-source lip-sync video generation: LatentSync

LatentSync is an end-to-end lip-sync framework based on an audio-conditional latent diffusion model. Unlike previous diffusion-based lip-sync methods that rely on pixel-space diffusion or two-stage generation, LatentSync does not require intermediate motion representations. This framework leverages the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, the team found that diffusion-based lip-sync methods suffer from poor temporal consistency due to inconsistencies in the diffusion process between different frames. To address this issue, the LatenSync team proposed Temporal REPresentation Alignment (TREPA), which enhances temporal consistency while maintaining lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align generated frames with real ones.

Demo

Original	Generated

Technical Framework

LatentSync uses Whisper to convert the mel spectrogram into audio embeddings, which are integrated into the U-Net via cross-attention layers. The reference frame and mask frame are concatenated with the noisy latent representation along the channel dimension as the input to the U-Net. During training, the team employs a one-step method to obtain an estimated clean latent representation from the predicted noise, then decodes it to get the estimated clean frame. TREPA, LPIPS, and SyncNet losses are added in the pixel space.

Link🔗

Code - https://github.com/bytedance/LatentSync
HuggingFace Space - https://huggingface.co/spaces/fffiloni/LatentSync
Model - https://huggingface.co/chunyu-li/LatentSync
Colab - https://colab.research.google.com/drive/1HoXxM6MIFXw3NPDM2URIxToGBbhLQXMQ
Replicate - https://replicate.com/bytedance/latentsync