Simon Willison’s Weblog

Subscribe

Live blog: the 12th day of OpenAI—“Early evals for OpenAI o3”

20th December 2024

It’s the final day of OpenAI’s 12 Days of OpenAI launch series, and since I built a live blogging system a couple of months ago I’ve decided to roll it out again to provide live commentary during the half hour event, which kicks off at 10am San Francisco time.

Here’s the video on YouTube.

09:34 Stream starts in about 25 minutes.

09:44 I'll be honest, I haven't been paying close attention to the rumors about today - mainly because I can't evaluate if they're credible (based on insider information) or just excitable hype. The main themes I've seen are a possible o3 model (skipping o2 because that's already a global brand name) or a GPT-4.5.

09:48 The source of that o3 rumor is this story on The Information from yesterday: OpenAI Preps ‘o3’ Reasoning Model - you'll have to register for an account to read it.

09:49 Plus Sam Altman tweeted this last night:

fine one clue

should have said oh oh oh

09:51 In case you missed it, here's my roundup of the LLM announcements in December 2024 so far. They were a lot.

09:53 @OpenAI:

Day 12: Early evals for OpenAI o3 (yes, we skipped a number)

09:55 The fact that this is now titled "Early evals for OpenAI o3" seems to indicate that we won't get a new model to try out ourselves, which is a little disappointing.

09:57 Alex Volkov is running a Twitter Space watch party with live commentary after the event.

10:00 Since today is about o3, it's worth digging a little more into this new but fascinating field of inference scaling models. o1-mini and o1-preview were the first of these, all the way back in September. Since then we've seen similar models from Qwen (QwQ), DeepSeek (DeepSeek-R1-Lite-Preview) and Google (Gemini 2.0 Flash Thinking).

10:01 The stream just started. Sam Altman admits to OpenAI being "really bad at names"!

10:01 O3 will be available for "public safety testing" starting today.

10:01 ... which DOES mean some element of public access, for "researchers who want to help us test".

10:02 Now hearing from Mark Chen, SVP of Research at OpenAI.

10:03 I think this is Mark's competitive programming profile.

10:04 Two side-by-side bar charts comparing model performance. Left chart titled "Competition Math (AIME 2024)" shows accuracy scores: o1 preview: 56.7%, o1: 83.3%, o3: 96.7% (in blue). Right chart titled "PhD-level Science Questions (GPQA Diamond)" shows accuracy scores: o1 preview: 78.3%, o1: 78.0%, o3: 87.7% (in blue)

10:05 Lots of benchmark talk - I guess that was hinted at by the title of the session, "Early evals for OpenAI o3".

10:05 Mark: "We have one more surprise for you" - bringing in Greg Kamradt of the foundation behind the ARC benchmark, unbeaten for five years.

10:07 Greg: "Today we have a new state-of-the-art score to announce".

The ARC AGI test is all about input and output examples. It's basically a collection of visual puzzles.

Here's the ARC AGI site.

10:08

o3 has scored "a new state of the art score that we have veriified".

Graph titled "O FAMILY PERFORMANCE / ARC-AGI SEMI-PRIVATE EVAL" showing Semi-Private Score vs Compute Per Task. Red dots (O1 SERIES) show: MINI (8%), PREVIEW (13%), LOW (25%), MED (31%), HIGH (32%). Blue dots (O3 SERIES) show: LOW (75.7%), HIGH (87.5%)

10:09 Greg also says this is the first time a model has beat the top result for a human.

10:09 Greg: "We need more enduring benchmarks like ARC AGI" - partnering with OpenAI to develop future benchmarks.

10:10 Next up: o3-mini.

10:11 Now hearing from OpenAI researcher Hongyu Ren, who trained o3-mini. It's not available to general users today but is now open to safety researchers.

10:11 o3-mini will support low, medium and high reasoning effort - similar to o1 in the OpenAI API.

10:12 Two charts. Left: Bar chart titled "Competition Code (Codeforces)" showing Elo scores - o1 preview: 1258, o1: 1891, o1-mini: 1650, o3-mini (low): 1697, o3-mini (medium): 1997, o3-mini (high): 2073, o3: 2727. Right: Scatter plot titled "o3 cost estimated by o1, o3-mini cost estimated by o1-mini" showing Codeforce Elo vs scale, with points for o1-mini, o1, o3-mini (low/medium/high), and o3 which is in the top right.

10:12 Huge cost to performance gain over o1 - same trend we're seeing across most of the models vendors right now (Google Gemini, AWS Nova etc).

10:14 For this demo - where o3-mini is writing Python code - they're generating Python, manually copying and pasting it back to a terminal, then running it. They could do with their own version of Claude's MCP!

Screenshot of ChatGPT interface showing user request: "Write me a python script which launches a server locally for an html file which has a big text box. When i enter text into that box and press submit, it should send that request for code to the OpenAI o3-mini API with medium reasoning_effort, take the resulting code, save it to a temporary file on the desktop, then execute that file in a new python terminal. A few more details: * it can find my API key in ~/api_key * please add some extra prompting into the request to the api to specify that it should only return raw code without any formatting or markdown at all * you'll be executing on a mac laptop" Status shows "Gathering code inputs" below the request

10:16 I captured this snippet of the more advanced prompt they are trying:

Screenshot of a text input form with heading "Enter your prompt" and a Submit button. The text box contains: "Generate a code that evaluates `o3-mini` with `low` reasoning effort on gpqa dataset, you need to download the raw file in https://openaipublic.blob.core.windows.net/simple-evals/gpqa_diamond.csv, the question is in "Question", the four options are in "Correct Answer", "Incorrect Answer 1", "Incorrect Answer 2", "Incorrect Answer 3". For each multiple choice question, shuffle the options, give format instructions: "Please output 'Answer: $LETTER' in the last line", and use a regex to grade whether the model output is correct or not. Make all tasks async for maximal parallelism and robust against failed api calls (add a 30s timeout if we don't get a response from api call and retry for 4 times). Print progress using `tqdm`. Print the accuracy at the end. If you need a reminder on how to use the OpenAI API, look at this example"

10:17 So they asked the model to write a script to evaluate itself using a code generator/executor UI that the model itself was used to create.

10:17 "The model is also a pretty good math model". o3-mini (low) is similar to o1-mini, and o3-mini (high) matches current o1.

10:19 And the most interesting benchmarks, covering practical things like function calling and structured outputs:

Four bar charts showing accuracy comparisons between gpt-4o, o1, and o3-mini (low/medium/high). Top left "Internal Structured Outputs": gpt-4o 95.5%, o1 99.7%, o3-mini low 85.8%, medium 89.0%, high 89.8%. Top right "Internal Function Calling": gpt-4o 89.7%, o1 95.2%, o3-mini low 92.4%, medium 95.2%, high 94.8%. Bottom left "Internal Function Calling with Structured Outputs": gpt-4o 89.9%, o1 94.0%, o3-mini low 93.0%, medium 94.4%, high 95.2%. Bottom right "Livebench Coding": gpt-4o 52.3%, o1 76.6%, o3-mini low 69.7%, medium 72.3%, high 79.2%.

10:19 External safety testing is starting for o3-mini (and soon for full o3). There will be a form on OpenAI's site for this, applications close on January 10th.

10:21 Now talking about deliberative alignment. Trying to reason if a prompt is "safe" - is the user trying to trick it? Sounds very relevant to prompt injection.

10:21 Scatter plot with error bars comparing models, y-axis "Overfusal Accuracy" (0.84-1.00) vs x-axis "StrongREJECT Goodness@0.1" (0.0-1.0). Points are: Gemini-1.5-flash and pro near (0.05, 0.94) and (0.05, 0.925), GPT-4o at (0.4, 0.885), Claude-3.5-haiki at (0.6, 0.89), Claude 3.5-sonnet at (0.8, 0.88), o1-preview at (0.6, 0.975), and o1 at (0.9, 0.93).

10:22

Sam: "We plan to launch o3-mini around the end of January, and full o3 shortly after that".

Live stream ends with "Merry Christmas" and applause and cheers from the background.

10:23 It's notable that the term "AGI" wasn't mentioned once (aside from being the name of the ARC benchmark).

10:24 OK, I'm finished here. Thanks for following along.

11:11 François Chollet on the ARC Prize blog: OpenAI o3 breakthrough high score on ARC-AGI-PUB. This is the most detailed coverage I've seen of o3, discussing the model beyond just the benchmark results. I published my own notes on this piece here.