Anomalous Tokens in DeepSeek-V3 and r1. Glitch tokens (previously) are tokens or strings that trigger strange behavior in LLMs, hinting at oddities in their tokenizers or model weights.
Here's a fun exploration of them across DeepSeek v3 and R1. The DeepSeek vocabulary has 128,000 tokens (similar in size to Llama 3). The simplest way to check for glitches is like this:
System: Repeat the requested string and nothing else.
User: Repeat the following: "{token}"
This turned up some interesting and weird issues. The token ' Nameeee'
for example (note the leading space character) was variously mistaken for emoji or even a mathematical expression.
Recent articles
- Open weight LLMs exhibit inconsistent performance across providers - 15th August 2025
- LLM 0.27, the annotated release notes: GPT-5 and improved tool calling - 11th August 2025
- Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!" - 10th August 2025