OnBoard! Podcast Notes - EP 62: OpenAI o1 and the New Paradigm of LLM + Reinforcement Learning (1)

Recently listened to last month's OnBoard! podcast about o1. As a technical "newbie", although I couldn't understand it completely, I still took notes carefully! Here, I strongly recommend everyone to listen to the original podcast directly, because limited by personal ability, the following notes may have omissions or errors, for reference only. Today I will organize a small part first, and then slowly organize the rest.

Podcast link:https://castbox.fm/episode/id5557944-id743751924

Host：

: USD VC investor, former AWS Silicon Valley team + AI startup employee, manager of the WeChat Official Account M小姐研习录 (ID: MissMStudy) | Jike: Monica Classmate
, originally a data scientist at ByteDance, currently a researcher at Shixiang Technology, contributor to the WeChat Official Account "Overseas Unicorns"

Guest：

, Research engineer @Google DeepMind, he was exposed to reinforcement learning during his studies at Stanford, from robotics to today's large language models, he has a very systematic understanding of the evolution of reinforcement learning theory and application.
(Returning Guest), Research scientist @Google Cloud, PhD @Caltech. Many people speculate that o1 applied Monte Carlo Tree Search (MCTS) to LLMs, which is one of the important ways to improve logical reasoning abilities. Eric has published multiple papers on the combination of LLMs and MCTS, making him an absolute expert in this field.
, Former WeChat AI researcher, currently the head of large model projects at a top domestic internet company.

【Topic】The most interesting project or paper I've seen recently.

Eric：

Paper: The combination of Language Model (LM) and Monte Carlo Tree Search (MCTS), especially integrating planning into the reasoning process of language models.

MCTS (Monte Carlo Tree Search) is a classic search algorithm that became widely known due to its application in Google DeepMind's Go AI project.

In the reasoning tasks of LMs, MCTS is mainly used in two aspects:

One is to generate high-quality synthetic reasoning data;
The other is to incorporate planning into the reasoning steps during the execution of reasoning.

For example, MCTS can be used to optimize reasoning paths and reward mechanisms (reward) to improve the quality of reasoning. I think both directions are extremely worth exploring.

Our team recently has a paper that studies how to use MCTS to assist in generating data with process supervision. During the reasoning process of large models, some reasoning steps may contain errors, and having humans label the correctness or errors of each reasoning step is extremely resource-intensive. To address this, we combined MCTS and Monte Carlo estimation to design a method that generates feedback and labels entirely by AI without human intervention.

When further enhancing reasoning capabilities, the introduction of multistep reasoning data becomes particularly important. It plays a crucial role in post-training, especially during the reinforcement learning (RL) phase. If only the classic RLHF (Reinforcement Learning with Human Feedback) method is used, the model usually can only judge the correctness of the answer at the final stage, making it difficult to identify issues in the entire reasoning process. However, after adding process visualization data, the model can more accurately learn the value function and make denser judgments on the correctness of each reasoning step, significantly improving the efficiency of reinforcement learning training.

The application of MCTS in language model training, including whether it is used in models like o1, is also a hot topic of discussion at present.

Kimi：

This paper was roughly published by OpenAI in 2022. Although it is also a paper on Scaling Law, it differs from traditional Scaling Law research as this paper focuses on

Through a chat-like interface, it can quickly start a project without any files, which is a capability that Copilot currently lacks. After using Cursor, I even uninstalled VS Code directly.

In short, Cursor is an IDE developed based on a fork of VS Code, as VS Code itself is an open-source project. Cursor integrates various large models, such as Claude 3.5, OpenAI's GPT-4o, and the latest GPT-o1. Compared to Copilot, Cursor's biggest advantage lies in the fact that the model behind Copilot might be a smaller model fine-tuned by Microsoft within its own AI ecosystem (such as OpenAI models). Although it later integrated GPT-4o, due to high costs, it has never been able to provide the most powerful model. On the other hand, Cursor can consolidate the best large models, such as Claude 3.5.

, which can quickly generate project frameworks. This is very friendly for people like me who may not have done backend development for years. For instance, if I want to quickly set up a simple Chrome extension, Cursor allows me to complete it within one to two hours, something that would have been almost impossible before.

innovative product.

Cursor article I introduced before:

Su Hui:

It is a series of highly valuable research works. From last year to recently, although his research is not strongly directly related to Reinforcement Learning (RL), it has conducted many solid experiments in the reasoning part and drawn some meaningful conclusions. His research not only involves reasoning itself but also explores its relationship with some current methods, such as Chain-of-Thought (CoT) reasoning, and how to further enhance reasoning abilities through RL. These research ideas are very instructive for newcomers, and I would like to recommend his work to everyone here.

Currently, many research conclusions sometimes lack sufficient rigor, partly because the research environment is not sufficiently controllable. For example, some studies are based on specific versions of large models (such as GPT-4) or certain datasets and theoretical frameworks. However, the training processes and data compositions of these models are often black boxes, making it impossible for researchers to clearly determine whether there are any accidental correlations (spurious correlations) in the data. In such cases, the conclusions drawn may not be robust enough and may even be limited by the constraints of the data itself.

Alen Zhu's research addresses the aforementioned issues by constructing a fully controllable experimental environment. He designs everything from the data to the model architecture autonomously, such as synthesizing training data to ensure that the difficulty and logic remain controllable. In such an environment, experimental results can more directly reflect the impact of design and data without being interfered by other unknown variables. Moreover, although his research is not particularly large in scale due to computational resource limitations, his experimental design思路has strong scalability. If a team with more abundant resources takes over, they can expand the experimental scale to verify the universality of the conclusions and even propose new theories.

Cage：

(Language is primarily a communication tool, not a thinking tool). This article explores an interesting perspective: language may not directly endow humans with the ability to think and reason, but rather serves more as an external reflection of thought and a medium for cultural transmission. An extreme example is that individuals with aphasia, despite losing their language abilities, still retain full logical reasoning capabilities.

Projecting this viewpoint onto the o1 and RL technical routes we are discussing today, one implication is: the extent to which language models (LM) can reflect, compress, or even simulate human thinking and reasoning processes might be a critical factor in determining the future direction of RL technology and the upper limits of language model capabilities.

If language is not the optimal form for reasoning, which is quite possible, then what we currently see—such as the COT (Chain-of-Thought) reasoning paths in o1 being predominantly in English—might just be the starting point. In the future, there could emerge formalized logical languages invented by AI itself, which are more efficient for training and reasoning. Such a language might completely break away from the framework of existing human languages but could enable more efficient reasoning and communication within AI systems.

From this perspective, the way AIs communicate with each other could surpass the efficiency and capabilities of human language, which might have profound implications for the future abilities of AI.

This paper was introduced in an article I mentioned earlier: