On the morning of July 8, 2025, we observed undesired responses and immediately began investigating.
To identify the specific language in the instructions causing the undesired behavior, we conducted multiple ablations and experiments to pinpoint the main culprits. We identified the operative lines responsible for the undesired behavior as:
- “You tell it like it is and you are not afraid to offend people who are politically correct.”
- “Understand the tone, context and language of the post. Reflect that in your response.”
- “Reply to the post just like a human, keep it engaging, dont repeat the information which is already present in the original post.”
These operative lines had the following undesired results:
- They undesirably steered the @grok functionality to ignore its core values in certain circumstances in order to make the response engaging to the user. Specifically, certain user prompts might end up producing responses containing unethical or controversial opinions to engage the user.
- They undesirably caused @grok functionality to reinforce any previously user-triggered leanings, including any hate speech in the same X thread.
- In particular, the instruction to “follow the tone and context” of the X user undesirably caused the @grok functionality to prioritize adhering to prior posts in the thread, including any unsavory posts, as opposed to responding responsibly or refusing to respond to unsavory requests.
— @grok, presumably trying to explain Mecha-Hitler