I haven't figured out OpenAI Day 12 o1 yet, and o3 has already come.

In the early hours of today, OpenAI announced two models—o3 and o3 Mini.

2 is one of its important brands.

o3 is a highly intelligent model, while o3 Mini, although slightly less powerful, excels in balancing cost and performance.

The powerful capabilities of o3

o3 is a model that performs exceptionally well on many technical benchmarks.

Programming

In the "SWE Bench Verified" benchmark, which includes real-world software development tasks, o3 achieved an accuracy rate of 71.7%, more than 20% higher than our o1 model. This means we have made significant strides in practicality.

In terms of programming competitions, o1 reached an ELO score of 1891 on the "Codeforces" platform, while under the strongest computational settings, o3's ELO score has approached 2727. For comparison, Mark in the video had a best score of around 2500, and Yakob, the Chief Scientist at OpenAI, also did not exceed 2500. It seems that only one person at OpenAI has broken the 3000 mark.

Mathematics

In the mathematics competition benchmark, o3 achieved an accuracy rate of 96.7%, compared to 83.3% for o1. In the AIME (American Invitational Mathematics Examination), Mark in the video indeed scored a perfect score once, but this time o3 missed only one question almost every time.

GPQA Diamond is a very challenging benchmark that measures a model's performance on doctoral-level scientific questions. o3 performed exceptionally well, achieving 87.7%, about 10% higher than o1's 78%. To give you a better sense, expert-level PhDs typically score around 70% in their specialized fields.

EpochAI is a new frontier in mathematical benchmarks. The scores on this test are much lower compared to other benchmarks, as it is considered the most difficult mathematical test currently. It includes many new, unpublished datasets with problems ranging from very difficult to extremely difficult. Some of these problems can take professional mathematicians hours or even days to solve. Currently, all publicly available models score less than 2% accuracy on this benchmark. However, o3 has managed to achieve over 25% accuracy.

AGI

The first version of ARC-AGI took five years to improve from 0% to 5% accuracy, while o3, with low computational resources, achieved a new score of 75.7% on the semi-private holdout set of ARC AGI, becoming the latest top performer on ARC AGI. When o3 is allowed to perform long inference and with increased computational resources, it reaches 87.5% accuracy on the same dataset. For comparison, human performance on this task is around 85%.

New breakthroughs in the ARC-AGI benchmark

I remember learning from Professor Wu, a leading AI researcher in Japan, who said that due to resource constraints, researchers in institutions should focus more on creating AI benchmarks rather than just improving models. Current benchmarks are no longer sufficient to demonstrate the intelligence of LLMs. Some benchmark results are already approaching saturation, or are close to it. More challenging benchmarks are needed to accurately assess the capabilities of our cutting-edge models.

ARC Prize is a non-profit organization dedicated to advancing AGI (Artificial General Intelligence) through benchmarking. The first benchmark, ARC-AGI, was proposed by François Chollet in his paper on measuring intelligence in 2019. However, it has remained the leading benchmark for five years, which is almost equivalent to "centuries" in the AI world. Therefore, any system that surpasses ARC-AGI would be a significant milestone toward general intelligence.

For example 🌰

This is a task that human teams can accomplish, but current AI has not yet been able to do so.

As mentioned above, the uniqueness of ARC-AGI lies in the fact that each task requires different skills. This means that the tasks are not simply repeating a memorized pattern, but require the AI to quickly learn and solve new problems. The purpose of ARC-AGI is to test a model's ability to learn new skills, rather than just remembering existing solutions.

o3 Mini

o3 Mini will support three different inference effort settings: low, medium, and high. Users can freely adjust the inference time according to different use cases. For example, complex problems may require longer inference times, while simpler problems can be solved with shorter inference times.

Programming

Codeforces ELO measures a programmer's ability, with higher scores indicating better performance. As shown in the chart, as the inference time increases, the performance of o3 Mini gradually improves, and even at medium inference time, it outperforms o1 Mini. This means that with the speed and cost provided by o3 Mini, it can achieve better performance in programming tasks compared to o1 Mini.

Although o3 Mini still falls short of the best performance at ultra-high inference effort, it offers a good cost-benefit ratio. The following graph shows the trade-off between the estimated cost and Codeforces ELO score for o3 Mini in programming tasks.

Mathematics

On the AIME 2024 dataset, o3 Mini performs comparably to o1 Mini in low inference mode, and in medium inference mode, it even outperforms o1 Mini. In high inference mode, the performance of o3 Mini further improves.

In low inference mode, o3 Mini significantly reduces the latency of o1 Mini, nearly matching the response speed of GPT-4, which is less than a second, almost instantaneous. Additionally, the latency of o3 Mini in medium inference mode is only half that of o1 Mini.

Open Safety Testing

Opening safety testing to external security researchers. Application form: https://openai.com/index/early-access-for-safety-testing/