Advertisement

Animate Anyone 2: High-Fidelity Human Image Animation and Environmental Interaction

). However, these methods fail to reasonably associate characters with their environment. To address this issue, Alibaba launched Animate Anyone 2, which aims to generate character animations through environmental interaction.

In addition to extracting motion signals from the source video, it also captures the performance of the environment and uses it as conditional input. The environment is defined as the area excluding the character, and the model of Animate Anyone 2 generates the character while maintaining consistency with the environment. The team proposed a shape-agnostic masking strategy that more effectively describes the relationship between the character and the environment. Additionally, to enhance the fidelity of object interactions, the team used an object guider to extract features of interacting objects and injected them into the denoising process through spatial mixing. Alibaba also introduced a pose modulation strategy, enabling the model to handle more diverse motion patterns. Experimental results demonstrate the superiority of this method.

motion

Animate Anyone 2 differs from previous methods that only use motion signals to generate character animations. This method further extracts environmental performance from the driving video, allowing the character animation to exhibit interactivity with the environment.

method

The figure above shows the framework of Animate Anyone 2. Environmental information is captured from the source video, where the environment is defined as the region excluding the character and is used as input to the model, enabling end-to-end learning of character-environment fusion. To maintain object interaction, the team also injected features of objects interacting with the character. These object features are extracted by a lightweight object guider and injected into the denoising process through spatial mixing. To handle more diverse motions, the team proposed a pose modulation method that better represents the spatial relationships between body limbs.

results

  • environmental interaction
    Animate Anyone 2 has demonstrated significant capabilities in generating characters with context-consistent environmental interactions, showcasing seamless integration of characters with scenes and strong character-object interactions.

  • dynamic motion
    Animate Anyone 2 has shown strong capabilities in handling complex motions while ensuring character consistency and maintaining reasonable interactions with the environment.

  • character interaction
    Animate Anyone 2 can generate interactions between characters, ensuring the rationality of their movements and maintaining consistency with the surrounding environment.

comparison

  • Comparison with Viggle
    Viggle can replace characters in videos with provided character images, similar to the application scenarios of the Animate Anyone 2 method. The results were compared with the latest Viggle V3. Viggle's output shows rough character-environment fusion, lacks natural motion, and fails to capture character-environment interactions. In contrast, the results of the Animate Anyone 2 method demonstrate higher fidelity.

  • Comparison with MIMO
    MIMO is the most relevant method to the task setting of Animate Anyone 2, which deeply decomposes characters, backgrounds, and occlusions in videos and recombines these elements to generate character videos. Animate Anyone 2 outperforms MIMO in robustness and detail preservation.