SkyReels V1 from Kunlun Tech is the latest open-sourceText-to-VideoandImage-to-VideoThe model and inference code can bring an unprecedented image generation experience.
It includes three contents:

SkyReels-V1 Text-to-Video Model SkyReels-V1 Image-to-Video Model SkyReels-A1, Portrait Image Animation Framework.
Introduction to SkyReels V1
SkyReels V1 is currently the most advanced open-source human-centric video foundation model in the industry, fine-tuned on tens of millions of film-level data entriesHunyuanVideo, achieving three core breakthroughs:
Open-source Leadership: The text-to-video model performance reaches the best in the open-source field, comparable to the top closed-source models Kling and Hailuo in the industry. Extraordinary facial animation: Accurately captures 33 facial expressions and over 400 natural motion combinations, perfectly presenting subtle human emotions. Cinematic lighting aesthetics: Training data originates from Hollywood-level film materials, with every frame showcasing exceptional cinematic composition, character positioning, and camera sense.
🔑 Key Features
1. Self-developed data cleaning and labeling system
Facial expression classification: Accurately distinguishes 33 types of facial expressions. Spatial role perception: Uses 3D human body reconstruction technology to deeply understand spatial relationships among multiple individuals. Action Recognition: Define over 400 action semantic units to precisely analyze human behaviors. Scene Understanding: Cross-modal analysis of the relationships between clothing, scenes, and plots.
2. Multi-stage Pre-training for Image-to-Video Conversion
Phase One: Model Domain Transfer Pre-training: Transfer the text-to-video model to the human-centric video domain using tens of millions of film and television data. Phase Two: Image-to-Video Model Pre-training: Convert the model from phase one into an image-to-video model and conduct pre-training. Phase Three: High-Quality Fine-tuning:Fine-tune the model on a carefully selected high-quality sub-dataset to ensure superior performance.
📊 Outstanding benchmark performance

Through VBench testing comparisons, SkyReels V1 performs best among open-source text-to-video (T2V) models, achieving an overall score of 82.43, surpassing VideoCrafter-2.0 VEnhancer (82.24) and CogVideoX1.5-5B (82.17), particularly excelling in dynamic presentation and multi-object handling capabilities.
🚀 SkyReels Infer Inference Framework
SkyReelsInfer is an efficient video inference framework that significantly accelerates the generation of high-quality videos.
Multi-GPU Support: Supports context parallelism, CFG parallelism, and VAE parallelism. Consumer-grade GPU Deployment: Model quantization and parameter offloading, significantly reducing VRAM requirements. Excellent Inference Performance:Inference speed improved by 58.3% compared to HunyuanVideo XDiT, setting a new industry benchmark. Excellent usability:Based on the Diffusers open-source framework, it provides a non-intrusive parallel implementation, making it simple and user-friendly.
Trial
https://www.skyreels.ai/
Step-Video-T2V was developed by StepFun. Step Stars (StepFun Technology Co., Ltd.), founded in April 2023, is a high-tech company located in China that specializes in the field of artificial intelligence. The company was established by Dr. Jiang Daxin, former Microsoft Global Vice President, focusing on the research and development of large model technologies and dedicated to advancing artificial intelligence technology towards Artificial General Intelligence (AGI).
Introduction
Step-Video-T2V is an industry-leading Text-to-Video generation model with up to 30 billion parameters, capable of generating videos with up to 204 frames. To improve training and inference efficiency, we developed a deeply compressed Video-VAE (Variational Autoencoder), achieving 16x16 spatial compression and 8x temporal compression. Additionally, Step-Video-T2V applies Direct Preference Optimization (DPO) in the final stage, further enhancing the visual quality of the video. Step-Video-T2V performs exceptionally well on the specially designed video generation benchmark Step-Video-T2V-Eval, reaching the forefront of text-to-video generation compared to other open-source and commercial models.
Model Overview
The core of Step-Video-T2V consists of three main components:
1. Video Variational Autoencoder (Video-VAE)
We have designed an efficient video VAE model that can achieve 16x16 spatial compression and 8x temporal compression, greatly improving the speed of training and inference while preserving video reconstruction quality. This compression method is also well-suited for the compact representation form used by diffusion models.
2. Diffusion Transformer (DiT) with 3D Full Attention Mechanism
Step-Video-T2V is built on the DiT architecture, consisting of 48 layers, each containing 48 attention heads with a dimension of 128 per head. The model integrates time-step conditions using AdaLN-Single and ensures training stability through QK-Norm in the self-attention mechanism. Additionally, we adopted 3D Rotational Positional Encoding (3D RoPE), effectively handling video sequences of varying lengths and resolutions.
3. Video Direct Preference Optimization (Video-DPO)
We introduced a direct preference optimization method based on video, fine-tuning the model with human feedback data to ensure that the generated video content better aligns with human intuition and aesthetic standards. DPO plays a key role in reducing visual artifacts and enhancing the continuity and realism of videos.
Through the above innovative design, Step-Video-T2V has set a new benchmark in the field of text-driven video generation, promoting the development and potential application of video content generation technology.
Trial use
https://yuewen.cn/videos
Wan2.1 is the next-generation video generation model developed by Alibaba's Tongyi Wanxiang team.
Key features of Wan2.1
Leading performance:Wan2.1 outperforms existing open-source models and leading commercial solutions in multiple mainstream benchmark tests, demonstrating excellent generation capabilities and stability.
Supports consumer-grade GPUs:The T2V-1.3B model of Wan2.1 requires only 8.19GB of VRAM to run, making it compatible with almost all consumer-grade GPUs. On an RTX 4090 graphics card, it can generate a 5-second 480P video in about 4 minutes without any additional optimizations such as quantization, and its performance can even rival some closed-source models.
Full multi-task support:In addition to the basic Text-to-Video function, Wan2.1 also supports Image-to-Video, video editing, Text-to-Image, and Video-to-Audio tasks, comprehensively promoting the development of the video generation field.
Powerful visual text generation capability: As the first video model with bilingual (Chinese-English) video-to-text generation capabilities, Wan2.1 demonstrates excellent text generation effects, expanding practical application scenarios.
Efficient Video VAE: Wan-VAE is renowned for its outstanding performance and efficiency, capable of encoding and decoding 1080P videos of any length without restriction while fully preserving temporal information, making it an ideal infrastructure for video and image generation tasks.
Technological innovation and architecture design
Wan2.1 adopts the mainstream Diffusion Transformer architecture, combined with a series of innovations, significantly enhancing generation performance. These innovations include:
3D Causal Variational Autoencoder (Wan-VAE): An innovative 3D causal structure that effectively improves spatiotemporal compression efficiency and reduces memory usage, ensuring causality and continuity in video generation.
Video Diffusion DiT Architecture: The DiT architecture developed under the Flow Matching framework uses a T5 encoder to process multilingual text inputs and embeds text through the cross-attention mechanism of Transformer blocks. Additionally, temporal embeddings are modulated via shared MLP layers, effectively improving generation performance.
Data Construction and Processing
Wan2.1 established a large-scale, high-quality image and video dataset through a rigorous data screening and deduplication process. A four-step data cleaning procedure ensured that the basic dimensions, visual quality, and motion quality reached an ideal level, significantly enhancing the model training effect.
Comparison with Existing Leading Models
To evaluate the performance of Wan2.1, the team designed a test consisting of 1035 internal prompts, covering 14 main dimensions and 26 sub-dimensions. After calculating scores based on human preference weights, Wan2.1 performed excellently in multiple key indicators, demonstrating its strong capability surpassing existing open-source and closed-source models.
The release of Wan2.1 has injected new momentum into the field of video generation. Its openness and advanced nature provide developers and researchers with more possibilities, contributing to building a richer and more colorful future visual world.
Trial
https://huggingface.co/spaces/Wan-AI/Wan2.1
https://modelscope.cn/studios/Wan-AI/Wan-2.1
First, generate a video from an image. The given prompt is "Elon Musk goes to Mars":
