The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.
For system prompt (SP) development you:
- Write a test set of messages where the model fails, i.e. where the default behavior isn't what you want
- Find an SP that causes those tests to pass
- Find messages the SP is missaplied to and fix the SP
- Expand your test set & repeat
Recent articles
- OpenAI's new open weight (Apache 2) models are really good - 5th August 2025
- ChatGPT agent's user-agent - 4th August 2025
- The ChatGPT sharing dialog demonstrates how difficult it is to design privacy preferences - 3rd August 2025