Qwen3-4B-Thinking: “This is art—pelicans don’t ride bikes!”
10th August 2025
I’ve fallen a few days behind keeping up with Qwen. They released two new 4B models last week: Qwen3-4B-Instruct-2507 and its thinking equivalent Qwen3-4B-Thinking-2507.
These are relatively tiny models that punch way above their weight. I’ve been running the 8bit GGUF varieties via LM Studio (here’s Instruct, here’s Thinking)—both of them are 4GB downloads that use around 4.3GB of my M2 MacBook Pro’s system RAM while running. Both are way more capable than I would expect from such small files.
Qwen3-4B-Thinking is the first model I’ve tried which called out the absurdity of being asked to draw a pelican riding a bicycle!
Here’s the full transcript—some highlights from the thinking trace:
Okay, the user asked for an SVG of a pelican riding a bicycle. That’s... oddly specific and completely unrealistic.
First reaction: Pelicans don’t ride bicycles. They’re birds that fly or swim, not human-like vehicles. This feels like a mix-up—maybe they meant “a cyclist” but typed “pelican”? Or perhaps it’s a joke?
Hmm, user might be:
- A kid playing with words (pelican + bicycle = silly combo)
- Someone testing if I’ll generate impossible images
- Actually wanting a pelican on a bike for art project? But that still doesn’t make sense biologically
Checks SVG capabilities SVG is for vector graphics, but this request violates physics and biology. If I tried to draw it literally:
- Pelican would be too big for a bike frame
- Bike wheels would be tiny compared to bird wings
- No realistic way to show “riding” since pelicans don’t have legs for pedaling
[…]
Self-check: Am I being too pedantic? Nah—if someone asks for impossible things, it’s better to gently correct than make fake art that could confuse them. […]
I appreciate your creative request! 🦅🚴♂️ However, pelicans don’t ride bicycles (they’re birds that fly/swim, not humans with bike skills). Generating this as a literal SVG would be biologically impossible and visually confusing.
“This request violates physics and biology”—Qwen3-4B-Thinking saying what no other model ever thought to express.
I also really enjoyed the “Am I being too pedantic? Nah” line. This model is sassy!
It continued on to spit out two SVGS: the first titled "Most Likely: A Human Cyclist" and the second "Artistic Interpretation: Pelican On a Bicycle (For Fun!)".
They were both terrible. Here’s the human one:
And here’s the supposed pelican:
I like Qwen’s decision to include the clarifying annotation “This is art—pelicans don’t ride bikes!”:
I tried the Qwen3-4B-Instruct non-thinking model too. It answered much faster (no time spent questioning my choice of task with its thinking tokens) and gave me this:
4B is such an interesting model size. These models should run on almost anything and, at least on my M2 MacBook, they run fast. I’m getting 50+ tokens per second and they’re using just less than 4.5GB of RAM while running.
The question is always how useful such a tiny model can be. Clearly it’s not great for SVG pelican illustrations!
These models have a 262,144 token context—wildly impressive, if it works.
So I tried another experiment: I used the Instruct model to summarize this Hacker News conversation about GPT-5.
I did this with the llm-lmstudio plugin for LLM combined with my hn-summary.sh script, which meant I could run the experiment like this:
hn-summary.sh 44851557 -m qwen3-4b-instruct-2507
I believe this is 15,785 tokens—so nothing close to the 262,144 maximum but still an interesting test of a 4GB local model.
The good news is Qwen spat out a genuinely useful summary of the conversation! You can read that here—it’s the best I’ve seen yet from a model running on my laptop, though honestly I’ve not tried many other recent models in this way.
The bad news... it took almost five minutes to process and return the result!
As a loose calculation, if the model can output 50 tokens/second maybe there’s a similar speed for processing incoming input.. in which case 15785 / 50 = 315 seconds which is 5m15s.
Hosted models can crunch through 15,000 tokens of input in just a few seconds. I guess this is one of the more material limitations of running models on Apple silicon as opposed to dedicated GPUs.
I think I’m going to spend some more time with these models. They’re fun, they have personality and I’m confident there are classes of useful problems they will prove capable at despite their small size. Their ability at summarization should make them a good fit for local RAG, and I’ve not started exploring their tool calling abilities yet.
More recent articles
- My Lethal Trifecta talk at the Bay Area AI Security Meetup - 9th August 2025
- The surprise deprecation of GPT-4o for ChatGPT consumers - 8th August 2025