Advertisement

"OnBoard!" Deep Interpretation of OpenAI Sora (Part 1) Podcast - Notes 1

Every time the podcast 《OnBoard!》 updates episodes related to AI, I listen to them at least once. Recently, the podcast shared content about Sora, divided into two episodes: the first one focusing on technical perspectives and the second from a venture capital perspective. The first episode contained more practical information for me, so I took notes. If you'd like to experience the original version, you can listen to the original podcast.

  • : Technological innovation and limitations, multimodal integration, and world models as seen by an AI researcher in Silicon Valley. Check out this episode on Castbox! https://castbox.fm/vd/674774452

  • : A new landscape of AI applications as seen by frontline investors and entrepreneurs. Check out this episode on Castbox! https://castbox.fm/vd/675169827

Introduction of two guests

  • Lijun Yu from Google VideoPoet, personal homepage: https://me.lj-y.com/
  • Yao Fu from the University of Edinburgh, personal homepage: https://franxyao.github.io/

Lijun Yu's self-introduction

A Ph.D. student at CMU, with long-term internships at Google.

Overview of the research journey:

  • : Focused on research in the field of video understanding.
  • : Shifted to the field of video generation research.

In the early stages of video generation research, the application of discrete Tokens and Transformer technology was explored.

In 2022, participated in proposing the MAGVIT (Masked Generative Video Transformer) framework, which is an innovative video generation Transformer framework. In 2023, research directions included Language Model DiSK Diffusion and latent representation techniques for videos.

At Google, participated in the development of the VideoPoet project. This is a framework based on autoregressive language models that goes beyond single modality and can handle and generate multimodal inputs:

  • Text-to-video
  • Image-to-video
  • Video to Audio
  • Video Editing

In addition to work in the multimodal field, many Scaling experiments were conducted in the VideoPoet project.

Participated in the work of Diffusion Transformer W.A.L.T, and the Diffusion training of MAGVIT v2 in the latent space.

One of the few individuals with research experience in multiple areas of video processing based on Transformers, including Mask Transformer, Autoregressive Transformer, and Diffusion Transformer.

Introduction of Fu Yao

A PhD student at the University of Edinburgh.

The main research focus is on large language models.

In the early stages, the research work focused on model scaling (Scale Up), including enhancing reasoning capabilities and developing long-context processing techniques. As language models continued to expand, they gradually evolved into multimodal models such as GPT-4 and Gemini. These models can handle various types of input, not just text.

Some discussion topics in the podcast

Comparison of Google VideoPoet and OpenAI Sora

Taking the "Streets of Japan" video as an example:

Sora excels in terms of coherence, maintaining consistent facial features of characters during scene transitions, and generating backgrounds that remain highly coherent, making the video appear more natural and realistic.

In comparison to Sora, VideoPoet has limitations in resolution and video duration. Although it supports scaling from 128 resolution up to 512, it still falls short compared to Sora's 1920 resolution. VideoPoet primarily focuses on generating video clips lasting 2 to 5 seconds, while Sora can produce videos up to 60 seconds long, demonstrating a significant advantage in duration.

Despite its limitations in resolution and duration, VideoPoet performs well in handling semantic coherence within videos, as well as the separation of foreground and background elements, maintaining good consistency.

Sora's most important innovation

  • : Sora adopts the Latent Diffusion Transformer model, which combines Auto-Encoder and Transformer technologies. It achieves compression in both space and time by converting between pixel space and latent space, similar to the Stable Diffusion model.
  • : Sora uses a pure Transformer model, which is different from other UNet image generation models that rely on convolutional neural networks, making it relatively novel in the field of video generation.
  • : Sora can train videos with different resolutions, aspect ratios, and durations, enhancing the flexibility and adaptability of the model.
  • : The large-scale model of Sora and the substantial training computing power required enable it to generate high-quality videos.
  • :Training with high-quality datasets is crucial for generating realistic videos.
  • :Using native structures suitable for video, such as 3D CNNs, to optimize video data processing.
  • :Transforming discrete encoding into continuous encoding; although VideoPoet also attempted something similar, it was unable to achieve this due to resource limitations.
  • : May include an additional consistency model to enhance the coherence and consistency of videos.

The surprising aspects of Sora

  • : Sora can directly generate high-resolution videos without requiring additional super-resolution or up-sampling models. This means that Sora can produce high-quality video content directly from the model without any extra post-processing steps. Although this may result in lower generation efficiency and longer time requirements, its capabilities are already quite advanced technically.

  • : Sora demonstrates its ability for efficient compression while maintaining video quality. Compared to previous models that might require 1,000 tokens to represent 1-2 seconds of 128-resolution video, Sora can generate up to 10 seconds of 1080p video at a similar or even higher compression ratio. This high compression efficiency means that Sora can handle longer video sequences while preserving the quality and detail of the video content.

  • : Sora's technical architecture is capable of handling sequences as long as one million (1m) in length, which poses a significant challenge in the field of video generation. To process such long sequences, efficient attention mechanisms and model architectures are required to ensure computational feasibility and efficiency. Compared to large language models, some models adopt distributed real attention mechanisms and special designs for long context windows, which are key technologies for handling long content windows.

Why is compression important?

  • : The Transformer model can handle an extremely large amount of data, up to 1 million tokens. By improving the compression ratio, the latent encoder can more effectively process large-scale datasets. This means that the model can process more information at a lower computational cost, thereby enhancing overall data processing efficiency. This is particularly important for tasks that require handling large amounts of data, such as video generation and natural language processing.
  • : In the Sora model, the decoding process from the latent space to the pixel channel is responsible for converting the high-level semantic information learned by the model into visual outputs. This conversion process is key to the model generating high-quality outputs. By ensuring that the bridge in this conversion process is sufficiently broad, information loss during the conversion can be reduced. This means that the model can more accurately convert the semantic information of its internal representation into visual output, thereby generating more realistic and higher-quality video content.

What are the differences in design and functionality between Transformers used in language models and multimodal models?

  1. Approach and Modality
  • : Typically focus on text data, using autoregressive methods to generate text and coherently predict the next word or sentence. These models, such as the GPT series, are primarily designed to handle single-modality (text) input. Although the latest models attempt to process multimodal data (such as images + text), their main strength still lies in language understanding and generation.

  • : Focus on video content generation, requiring the processing and understanding of various types of data, including text, images, and audio, to produce videos. This means that Sora must be able to convert different modalities of input into video output in its design, which is a more complex task than simply handling text data.

  1. Output content
  • : Mainly text, even when processing multi-modal input, the output is usually in text form, such as generating text to describe images.

  • : The output of video models like Sora can be multi-modal, not limited to text or images, but also including videos themselves. These models differ in design from LLMs and may:

    • : For example, by using an independent image decoder or mapping data to a specific latent space, such as Stable Diffusion, to enhance video generation capabilities.
    • : Models like VideoPoet directly implement video generation within the Transformer framework, which involves representing video frames as tokens in pixel space and using Transformers to generate video content.
    • :Sora uses a diffusion denoising method for training, and the decoding process is not based on autoregressive prediction but rather on the diffusion process.
  1. Model architecture and training methods
  • :It relies on large amounts of text data for pre-training, learning language patterns through an autoregressive approach, and may then be fine-tuned for specific tasks.

  • :It likely combines traditional Transformer architectures with techniques specific to video generation, such as diffusion models and latent space mapping. Its training method might also differ from standard LLMs, focusing more on video content generation.

How to view the emergent capabilities of Sora

In the field of large-scale models, emergent capabilities refer to those abilities that are not observed in smaller-scale models but suddenly appear after scaling up. These capabilities are usually performance improvements brought about by increasing model size, and such improvements are sudden and unpredictable. However, there is controversy surrounding emergent capabilities: some studies suggest that their appearance may be related to the choice of evaluation methods. When non-linear or discontinuous evaluation methods are used, the model appears to exhibit emergent capabilities; however, if a linear measurement method is adopted instead, this capability may become less apparent or disappear altogether.

As for Sora's emergent capabilities, as the model scale increases, it can generate richer and more complex content, such as better understanding and expressing complex concepts, distinguishing between foreground and background in generated scenes, and even differentiating various parts of the background. Previously, VideoPoet made significant progress in processing and integrating different modalities, such as converting text to video, then from video to audio, and adding appropriate sound effects to the generated videos, as well as corresponding music for instruments in the video, demonstrating the model’s ability to understand video.

The application of Diffusion Models in video understanding tasks is still in its exploratory phase. Unlike autoregressive models that directly predict the value of the next pixel or frame, how to effectively apply Diffusion technology to video understanding and generation, including technical details and best practices, remains an open research question.

Is it possible to combine Diffusion Transformer models with autoregressive (Auto-regressive, AR) Transformer models?

Combining Diffusion Transformer models with autoregressive (Auto-regressive, AR) Transformer models is a cutting-edge and promising research area that may offer new solutions and insights into handling complex high-dimensional data such as images and videos. This combination is not only feasible but also likely to positively impact the model's predictive capabilities and its ability to process multimodal data.

Possible Combination Advantages

  • Flexibility of Model Architecture: By adopting frameworks such as Mixture of Experts (MoE), multiple prediction experts can be integrated into a single model. This design allows some experts to focus on next-word prediction, which is characteristic of autoregressive models, while others concentrate on denoising prediction in Diffusion models. This flexible architectural design enhances the model's ability to handle various tasks and enables it to dynamically adjust according to different needs.
  • Advantages of Parallel Prediction: Combining Diffusion and AR models may enable more efficient parallel prediction mechanisms. Diffusion models can consider overall global information during their generation process, whereas AR models rely on previously generated contexts to predict each new unit (such as words or pixels). This combination is expected to improve efficiency and accuracy when processing high-dimensional data.

Enhanced Multimodal Data Processing

  • This combination approach also helps enhance the model's ability to process and generate multimodal data (such as text, images, and videos). By understanding and generating different types of data within the same framework, the model can better grasp the intrinsic connections between data, achieving richer and more accurate multimodal outputs.

The key to deep understanding

  • As Richard Feynman said, "What I cannot create, I do not understand." Combining different generative techniques allows models to gain a deeper understanding of the essence of data during the content creation process. This deep understanding is the key to building foundational models that can truly understand multimodal data.

Does the generative capability of AI models represent understanding?

AI models respect certain physical laws, which may have been observed during the training process. However, the physical laws followed by the model are not presented in a way that aligns with human understanding. Language follows human logic, while videos reflect the model's own understanding. The model might summarize known rules to humans or even discover unknown rules. How do we know it has summarized new rules? This requires verification through language—it must be able to communicate effectively with humans.

From the perspective of scaling laws, what are the differences between autoregressive (AR) models and Diffusion models in handling data and learning tasks?

Difference in objective functions

  • Autoregressive (AR) model: The AR model minimizes prediction loss by predicting the next word or pixel in a sequence, making it suitable for lossless compression. An increase in model size typically implies higher prediction accuracy and better data compression efficiency.
  • Diffusion model: Unlike AR models, the goal of Diffusion models is to generate data by introducing and gradually removing noise, which goes beyond just data compression. This method focuses more on generation quality and proximity to the data distribution.

Adaptability to data forms
  • Autoregressive (AR) model: It is particularly suited for handling discrete data types, such as text. In sequential tasks like text generation or music composition, AR models can predict the next most likely output step by step.
  • Diffusion Models: Better suited for handling continuous data, such as images and videos. This type of model generates continuous data close to the real distribution by controlling and adjusting Gaussian noise, making it perform well in image and video generation tasks.

The relationship between model scale and performance
  • Learning difficulty and expressive power: Diffusion models may require a relatively smaller model scale to achieve good results when processing continuous data because they operate directly in the continuous space. In contrast, AR models may need a larger model scale to capture complex sequence dependencies when processing discrete data.
  • Model scaling effects: Scaling up AR models usually directly improves prediction accuracy and data compression efficiency. For Diffusion models, scaling focuses more on enhancing the authenticity and diversity of generated data.

What are World Models

World Models is a type of computer model designed to simulate and understand the dynamics and patterns of the real world. This model analyzes past and present data in an attempt to predict future states, thereby providing a basis for decision-making. World models typically predict various possible future scenarios based on probability distributions and transition probabilities rather than trying to enumerate all possibilities. This method allows the model to make reasonable predictions even in the presence of uncertainty.

Physical Laws and Simulators

World Models aim to understand the world with less reliance on human-input physical laws. While traditional simulators depend on explicitly coded physical laws by humans, world models "understand" these laws by learning patterns from data. For example, in autonomous driving or robotics, video prediction models can learn behavioral patterns rather than simply replicating rules input by humans.

The History of Physics and AI Applications

The development of physics—from Newtonian mechanics to relativity and quantum mechanics—demonstrates the deepening of human understanding of natural laws. Similarly, in the field of AI, as models grow larger and more data becomes available, they are able to learn more complex and refined patterns. This applies not only to language generation, such as drafting legal documents, but also extends to generating video content, such as the physical behavior of a race car turning around a corner.

The essence of model learning

The goal of world models and other AI models is to learn the rules for generating data, not just memorizing it. This means that the model weights reflect an understanding of the rules of the world. Such understanding allows AI to demonstrate deep insights into complex phenomena within specific domains, such as language generation or video content creation.

Accuracy of rule understanding

As the model scales increase and more data is encountered, the model can understand more complex and specialized rules. This deeper understanding extends beyond linguistic expertise to more precise simulations of the physical world. For example, the model can learn specific rules about object behavior under certain conditions, such as how race cars behave under specific circumstances.

Applications of Sora and models

Advanced AI models like Sora demonstrate that through large-scale data training, AI can simulate and understand complex world rules on multiple levels. These models, by integrating existing AI technologies such as autoregressive models and Diffusion models, are capable of generating high-quality language and visual content, reflecting a deep understanding of the real world.

What is the predicted size of the Sora model? Is further expansion still needed?

  • VideoPoet: It has achieved 8B (8 billion) parameters, which is a relatively large model capable of generating high-quality video content.
  • Diffusion Transformer's DiT-XL: It has approximately 1M (1 million) parameters, indicating that even smaller models can achieve effective learning and generation tasks.
  • Sora: It is estimated to have around 10B (10 billion) parameters, although some predictions suggest it might be 3B (3 billion). This indicates there may be different strategies in terms of model design and computational power investment for training.

Smaller models, larger data: This is a potential trend that means improving model performance by increasing the scale of data while maintaining or even reducing the size of the model. This approach can reduce inference costs and make AI applications more cost-effective.

Consideration of inference costs: Smaller models are cheaper for inference, which is especially important for applications requiring frequent or real-time reasoning. This drives researchers and developers to seek more efficient model architectures and training methods to achieve optimal performance with limited resources.

The case of video content generation shows that even with large amounts of video data, relatively smaller models can achieve satisfactory generation quality through carefully designed model structures and training strategies.

Computational resource investment: The training and inference costs of models are closely related to available computational resources. In situations where computational power is limited, developing smaller and more efficient models becomes an important goal.

Is it necessary to continue expanding the model:

As the scale of the model increases, performance improvements are typically observed, including better understanding, more accurate predictions, and enhanced capabilities for handling complex tasks. If the current model does not perform ideally on specific tasks, scaling up the model size may be a viable solution.

If application requirements include processing and generating various types of data (such as text, images, videos, etc.), larger models may be more necessary. This is because large models can store and process a wider variety of information, thereby better understanding and integrating multimodal data.

What is the computational power estimation required for training Sora?

Training cost estimation

  • Llama 70B model: Trained for one month using 2000 NVIDIA A100 GPUs. This reflects the enormous computational power and time investment required to train large language models.
  • VideoPoet: Trained for two weeks using hundreds of NVIDIA H100 GPUs. Although the model size of VideoPoet may not be as large as Llama, the complexity of processing video data may lead to a significant demand for computing power.
  • Sora: May require thousands of NVIDIA H100 GPUs for one month of training. Considering the scale of the Sora model and the complexity of handling high-resolution video data, its computational power requirements could be even higher.

Characteristics of models and data

  • Model scale and sequence length: Although the scale of Sora's model may not be large, the sequence length required when processing video data is much longer than that for text data, and the information density of video data is usually lower than language, which increases the difficulty of training and the demand for computing power.
  • GPU optimization and architecture: Although the existing GPU infrastructure has been well optimized for Transformer models, when returning from latent representations to pixel-level data, the encoders and decoders involved, as well as conclusion-based architectures, may require further hardware support and optimization.
  • Video preprocessing: Video data can be preprocessed to reduce the burden during training and inference, but this requires a carefully designed data processing pipeline.

How about the estimation of Sora's inference cost?

  • Inference cost: For models like Sora, each step in the denoising process may require as much computational power as the entire process of an autoregressive (AR) model, leading to video generation taking up to 20 minutes. This reflects the high cost of video generation models during inference.

  • Computational and memory constraints: The time consumption of AR models is mainly due to memory access rather than computation, whereas diffusion models are computationally intensive. This means that although their cost is high, they may be more optimized in terms of performance compared to AR models.

How to improve inference speed?

  1. Hardware and computing power improvement
  • Hardware advancement: With the performance improvement of GPUs and other dedicated hardware (such as TPUs), we can expect faster data processing and computational speeds. As a compute-intensive task, Transformer models, especially in terms of memory bandwidth, have high requirements. Hardware improvements will directly enhance inference speed.
  • Computing power improvement: Stronger computing power not only means faster calculation speed but also enables models to more effectively handle large amounts of data, particularly for video tasks that are data-intensive.
  1. Engineering optimization
  • Better batching: By optimizing data batching strategies, more data can be processed simultaneously, reducing I/O wait time and improving GPU utilization.
  • LLM (Large Language Model) Optimization: Engineering optimizations for large models, such as model pruning and quantization, can reduce the computational demands of a model, thereby accelerating inference speed.
  1. Algorithmic Improvements
  • Optimization of Diffusion Models: There is still considerable room for improvement in diffusion models, such as reducing the number of decoding steps or developing more efficient sampling strategies to enhance generation speed.
  • Algorithm Efficiency: New algorithm discoveries or improvements to existing algorithms can also significantly increase inference speed, such as improved attention mechanisms, more efficient data encoding and decoding techniques, etc. The future of inference speed: With continued advancements in these areas, we can expect that in the future, the inference time for generating 10 seconds of video content will be drastically reduced to within one minute.

Can scaling laws be applied to the compression rates of encoders and decoders?

  • Compression Ratio vs. Model Scale: Theoretically, if the scale of the encoder and decoder is expanded from 1B parameters to 100B parameters, a higher compression ratio can be expected, reducing the sequence length from 1M to 1K. This is because larger models have stronger learning and representation capabilities, enabling them to more effectively capture and encode the complexity and details in the data.
  • Model Scale and the Need for Intermediate Models: When the scale of the encoder and decoder increases to a certain extent, theoretically, the reliance on intermediate models (such as Transformer models) can be reduced. This is because powerful encoders and decoders can more directly process and generate high-quality data representations, thereby reducing the necessity of intermediate processing steps. Practical trade-offs in applications.
  • Efficiency and Bottlenecks: During the scaling process, there is a contradiction between the desire for high efficiency in intermediate models and the desire for the encoder and decoder to be as small as possible to avoid becoming bottlenecks. In practice, the goal of optimizing the model is to find a balance among the components to achieve overall efficiency and high performance.
  • Special Cases in Vision Models: For vision models, if the decoder is very powerful, theoretically, additional processing layers may not be necessary. Ideally, a powerful decoder alone would suffice to recover high-quality image or video content from the encoded representation. However, this requires the decoder to possess extremely strong generative capabilities. Compression ratio and information density.
  • Control of Information Density: When processing video and language data, the goal is to ensure that the information density after encoding does not vary too much. This requires controlling the compression strength during the design of the encoder to ensure that key information in the data is effectively retained.
  • Selection of compression ratio: Choosing an appropriate fixed compression ratio requires balancing input resolution with the potential capabilities of the encoder. Additionally, considering that different parts of the data may have varying levels of importance, dynamically allocating attention resources is also a direction for optimizing model performance.

Will supporting multi-modal inputs affect the quality of generation?

Under a fixed resource budget, if 100% of the resources are used to learn one modality (such as text), the model's performance in that modality may reach its optimal level. However, if some resources are allocated to learning other modalities, the performance of the single modality may decrease somewhat, as the model needs to distribute its attention and learning ability across multiple tasks.

Ideally, if resources can be increased so that even when learning multi-modal data, each modality can still receive sufficient learning resources, the model’s performance in each modality may not be significantly affected, and might even improve overall performance due to the complementary learning between modalities.

In the short term, the combination of models like ChatGPT and Sora is unlikely to happen immediately, mainly because the learning requirements and resource allocation strategies for different modalities need to be carefully designed and adjusted. Moreover, multi-modal learning requires the model to handle and understand complex relationships between different types of data, which is itself a challenge.

Can multimodal models introduce a function similar to RAG (Retrieval-Augmented Generation) in language models for data not included in the training of video models, such as newly released games or user-private games?

  • Flexibility and Denoising: Diffusion models have significant advantages in generating high-quality images and videos, especially their denoising process can produce visually realistic content. Combining with AR models can increase control over sequential data, making the generated text or language content more accurate and coherent.
  • Long Sequence Processing: The long sequence processing capability of Transformer models enables them to understand and generate complex multimodal content, including long videos and detailed game descriptions.
  • Example-driven Learning: Providing the model with examples related to new games or specific content can help the model better understand and generate data in these fields. These examples can be retrieved via the RAG mechanism when needed, thus assisting the generation process.

Is the current model's extension an optimization within the diffusion framework or on the Transformer framework?

The optimization of Transformer models is engineering-oriented: The optimization of Transformer models often focuses on improving computational efficiency and reducing model size to facilitate processing long sequence data and reduce inference time. The optimization of Diffusion models is theory-driven: The optimization of Diffusion models mainly focuses on algorithm improvements to enhance the quality and efficiency of generated data.

What kind of challenge does the emergence of Sora pose for start-ups like Pika?

Pika previously mainly adopted the Latent Diffusion model with a U-net structure, which has some gaps compared to Sora's existing models.

Although the previous training data can still be utilized, the demand for computing power has significantly increased, bringing more infrastructure and cost pressures.

Apart from the brute-force aesthetics of Sora’s large models and high computing power requirements, what other important factors are there?

Apart from the scale of the model and its high requirements for computing power, Sora's success also depends on several key factors, mainly including data screening, organization, and technical details of the model. Selecting high-quality data that conforms to physical laws and has less special effects editing is crucial for model training.

What are the experiences in training data from the perspective of VideoPoet?

Although YouTube data is diverse, directly using these data is not the best choice because it contains a large amount of low-quality content, such as boring traditional game live streams and repetitive music videos. Therefore, selecting high-quality data becomes particularly important.

In the field of data engineering, the integration of artificial intelligence plays a vital role. The best data often comes from high-level human creations. For example, in the training process of large language models, textbooks are excellent data sources. Textbooks compile all the information we need, and their authors are usually experts in the field who invest significant effort, making the content rich, illustrated, and thoroughly explained. This kind of data is often created by professionals with doctoral degrees.

As for high-quality sources of video data, they are mainly purchased stock libraries, including authorized stock videos, news, and media material libraries, which are usually labeled. Although movies and TV works can also serve as data sources, they may involve copyright issues.

Generally speaking, the better educated the creator of the data is, the higher the quality of the data tends to be, but this also means that copyright protection measures will be stricter.

Does Videopoet use synthetic data?

Videopoet does not use synthetic data. However, if data generated by game engines is considered synthetic data, then Sora might use such data. Videopoet focuses more on model innovation, so the choice of data depends on specific needs. Generating data itself is a huge task, often relying on existing AAA game titles to obtain unique datasets. The advantage of using these data is that they follow physical rules.

If the purpose of generating models is to learn the rules of the world, there may be some rules that are not fully covered by videos. Game engines have already very precisely written physical rules. Combining these rules with those learned from the real world can help improve model performance.

Combining LLMs with coding can enhance their reasoning abilities. Would adding physical rules help video generation models?

Agree with the point of view:

The logic of natural language is vague and has many gray areas. On the other hand, programming languages are strict formal languages that strictly follow logical rules and have no ambiguity. Therefore, programming languages can make up for the deficiencies in logical reasoning that exist in natural languages.

For example, when a cup breaks, the physics engine simulates it as shattering. However, in reality, if it's tempered glass, the situation could be entirely different. The data generated by game engines can handle such complex scenarios.

Oppose the point of view:

The use of game engines deviates from the original idea of forming physical laws through observation of the world. Now, humans have summarized physical laws and input these laws into machines. However, if a model could spontaneously discover new laws, relying solely on existing physical laws would become a limitation.

Naturally generated videos, aside from possibly being processed with human special effects, also follow physical laws. Providing more solutions that meet physical conditions can enable more effective predictions.

Videopoet not only has text-image pairing data but also a large amount of video and image data for training. Image-to-video training is helpful for text-to-video training, especially for labeled videos. Therefore, image-to-video training is also beneficial for improving the effectiveness of text-to-video training.

What shifts in thinking have occurred since the emergence of Sora?

With the advent of Sora, people began to realize that LLMs (Large Language Models) are scalable, and Diffusion Transformers are also scalable. This architecture itself is scalable, with different learning curves, but all can be expanded. In the future, people may focus more on MoE (Mixture of Experts).

The most common misconceptions, underestimations, and overestimations about Sora.

Underestimation:

  • Holding an underestimating attitude towards multi-resolution design. This design approach is relatively popular. If a fixed resolution is adopted, it may have adverse effects on the data. Its application on generative models might produce significant impacts in the future. More effective use of data. The efficiency of generation and training could potentially be tripled.

Overestimation:

  • Overestimation of world models. Although there may be internal world models, they might not be usable for visual understanding tasks. Despite discussions about GPT in the visual domain, it is believed that the level of GPT in the visual domain has not yet been reached.

  • Overestimation of video quality output. Including diversity and success rate. Even with the same prompt, producing many different videos and ensuring that each one is excellent is what defines a good model.

How long does it take to replicate Sora?

Replicating Sora involves considerations of infrastructure, computing resources, and data. The model itself is relatively less important, so rough estimates can be made. For example, large companies like Google or Meta might be able to replicate Sora within half a year, while smaller companies may need more time, possibly up to one year.

Gemini has already surpassed GPT-4, and its open-source version even exceeds GPT-3.5. (Guest opinion, some people may disagree)

Companies with abundant GPU resources and talent can, to a certain extent, easily handle challenges similar to these models; it’s mainly an issue of time.

For situations with scarce GPU resources, it is difficult to replicate Sora using limited computational resources, which tests the capabilities of talent even more.

The performance improvement of small models is obvious.

In image generation, many small companies have accumulated a large amount of data, such as open-source projects like Midjourney or Stable Diffusion. OpenAI did not fully commit to developing DALL-E 3 because it was also working on Sora at the same time. Large companies may catch up, but for smaller companies, the cost of catching up might not be worth it.

In 2023, the paper on Diffusion Transformer was rejected by an academic journal but achieved great success in the industry.

Diffusion Transformer is a concept focusing on benchmarking and scalability rather than innovative solutions. Although initially rejected by academic journals for lack of innovation, it has made significant progress in the industrial field. This approach emphasizes data filtering and organization with relatively less input from human intelligence.

The paper on Diffusion Transformer was initially rejected by CVPR 2023 due to perceived lack of innovation but was later accepted by ICCV 2003. This initial rejection highlights academia's emphasis on novelty over practical effectiveness. Despite setbacks in academia, Diffusion Transformer has achieved considerable success in the industrial sector. It is recognized for its simplicity and scalability, which are crucial for handling large datasets. The adoption of this model in industry underscores a trend where practicality and performance may outweigh theoretical innovation.

Researchers may become bored because the well-designed model is not as good as letting smart people annotate the data.

What are you most looking forward to this year?

What people look forward to the most is the launch of new hardware with an order-of-magnitude increase in computing power, while both production and usage costs decrease.

The TPU used by Videopoet has good scalability and high efficiency, but poor flexibility. It can only use TensorFlow and JAX, and only supports static graphs.

The scale of TPU computing power is comparable to that of the same-generation GPUs, but at a lower cost, despite no significant improvement in computing power.

It also looks forward to the integration of Diffusion Transformer and Auto-Regressive Transformer.

Dr. Fu believes that people expect a model to accomplish all tasks. As the model scales up, the granularity of rule understanding will also increase, including world models and video models. For example, the scale of the vision model in PaLM is 22B. Video models should be at least as large as language models, or even larger. One 1T handles language, while another 1-2T handles other modalities.

People hope that the generated content lasts longer, is clear, conforms to physical laws, and has the ability to surprise (Emergence Ability), while also featuring cross-modal capabilities. For example, the model can give lectures to white-collar workers, prioritizing the use of an image or a gif. This kind of multi-modal educational approach will bring about a revolution in education.

Dr. Yu believes that regarding the integration of chatGPT and Sora, there might be scientific research this year, but it won't lead to practical product advancements.

Since many academic terms are unfamiliar to me, there may be many errors in the note-taking process. I hope everyone still listens to the original podcast to ensure the accuracy of the information.