OpenAI o3-mini, now available in LLM
31st January 2025
OpenAI’s o3-mini is out today. As with other o-series models it’s a slightly difficult one to evaluate—we now need to decide if a prompt is best run using GPT-4o, o1, o3-mini or (if we have access) o1 Pro.
Confusing matters further, the benchmarks in the o3-mini system card (PDF) aren’t a universal win for o3-mini across all categories. It generally benchmarks higher than GPT-4o and o1 but not across everything.
The biggest win for o3-mini is on the Codeforces ELO competitive programming benchmark, which I think is described by this 2nd January 2025 paper, with the following scores:
- o3-mini (high) 2130
- o3-mini (medium) 2036
- o1 1891
- o3-mini (low) 1831
- o1-mini 1650
- o1-preview 1258
- GPT-4o 900
Weirdly, that GPT-4o score was in an older copy of the System Card PDF which has been replaced by an updated document that doesn’t mention Codeforces ELO scores at all.
One note from the System Card that stood out for me concerning intended applications of o3-mini for OpenAI themselves:
We also plan to allow users to use o3-mini to search the internet and summarize the results in ChatGPT. We expect o3-mini to be a useful and safe model for doing this, especially given its performance on the jailbreak and instruction hierarchy evals detailed in Section 4 below.
This is notable because the existing o1 models on ChatGPT have not yet had access to their web search tool—despite the mixture of search and “reasoning” models having very clear benefits.
o3-mini does not and will not support vision. We will have to wait for future OpenAI reasoning models for that.
I released LLM 0.21 with support for the new model, plus its -o reasoning_effort high
(or medium
or low
) option for tweaking the reasoning effort—details in this issue.
Note that the new model is currently only available for Tier 3 and higher users, which requires you to have spent at least $100 on the API.
o3-mini is priced at $1.10/million input tokens, $4.40/million output tokens—less than half the price of GPT-4o (currently $2.50/$10) and massively cheaper than o1 ($15/$60). The GPT-4o comparison isn’t quite as simple as that though, as o3-mini’s invisible reasoning tokens still count towards the output tokens you get charged for.
I tried using it to summarize this conversation about o3-mini on Hacker News, using my hn-summary.sh script.
hn-summary.sh 42890627 -o o3-mini
Here’s the result—it used 18,936 input tokens and 2,905 output tokens for a total cost of 3.3612 cents.
o3-mini (and o1-mini) are text-only models: they don’t accept image inputs. The full o1 API model can accept images in the same way as GPT-4o.
Another characteristic worth noting is o3-mini’s token output limit—the measure of how much text it can output in one go. That’s 100,000 tokens, compared to 16,000 for GPT-4o and just 8,000 for both DeepSeek R1 and Claude 3.5.
Invisible “reasoning tokens” come out of the same budget, so it’s likely not possible to have it output the full 100,000.
The model accepts up to 200,000 tokens of input, an improvement on GPT-4o’s 128,000.
An application where output limits really matter is translation between human languages, where the output can realistically be expected to have a similar length to the input. It will be interesting seeing how well o3-mini works for that, especially given its low price.
Update: Here’s a fascinating comment on this by professional translator Tom Gally on Hacker News:
I just did a test in which both R1 and o3-mini got worse at translation in the latter half of a long text. [...]
An initial comparison of the output suggested that, while R1 didn’t seem bad, o3-mini produced a writing style closer to what I asked for in the prompt—smoother and more natural English. But then I noticed that the output length was 5,855 characters for R1, 9,052 characters for o3-mini, and 11,021 characters for my own polished version. Comparing the three translations side-by-side with the original Japanese, I discovered that R1 had omitted entire paragraphs toward the end of the speech, and that o3-mini had switched to a strange abbreviated style (using slashes instead of “and” between noun phrases, for example) toward the end as well. The vanilla versions of ChatGPT, Claude, and Gemini that I ran the same prompt and text through a month ago had had none of those problems.
More recent articles
- LLM 0.22, the annotated release notes - 17th February 2025
- Run LLMs on macOS using llm-mlx and Apple's MLX framework - 15th February 2025
- URL-addressable Pyodide Python environments - 13th February 2025