Simon Willison’s Weblog

Subscribe
Atom feed for interpretability

7 posts tagged “interpretability”

2025

Visual Features Across Modalities: SVG and ASCII Art Reveal Cross-Modal Understanding (via) New model interpretability research from Anthropic, this time focused on SVG and ASCII art generation.

We found that the same feature that activates over the eyes in an ASCII face also activates for eyes across diverse text-based modalities, including SVG code and prose in various languages. This is not limited to eyes – we found a number of cross-modal features that recognize specific concepts: from small components like mouths and ears within ASCII or SVG faces, to full visual depictions like dogs and cats. [...]

These features depend on the surrounding context within the visual depiction. For instance, an SVG circle element activates “eye” features only when positioned within a larger structure that activates “face” features.

And really, I can't not link to this one given the bonus they tagged on at the end!

As a bonus, we also inspected features for an SVG of a pelican riding a bicycle, first popularized by Simon Willison as a way to test a model's artistic capabilities. We find features representing concepts including "bike", "wheels", "feet", "tail", "eyes", and "mouth" activating over the corresponding parts of the SVG code.

Diagram showing a pelican riding a bicycle illustration alongside its SVG source code. The left side displays two versions: a completed color illustration at top with a white pelican with yellow beak on a red bicycle with blue wheels (labeled "Bike" and "Wheels"), and a line drawing sketch below with labels "Fur/Wool", "Eyes", "Mouth", "Tail", and "Bird". The right side shows the corresponding SVG XML code with viewBox, rect, ellipse, circle, and path elements defining the illustration's geometry and styling.

Now that they can identify model features associated with visual concepts in SVG images, can they us those for steering?

It turns out they can! Starting with a smiley SVG (provided as XML with no indication as to what it was drawing) and then applying a negative score to the "smile" feature produced a frown instead, and worked against ASCII art as well.

They could also boost features like unicorn, cat, owl, or lion and get new SVG smileys clearly attempting to depict those creatures.

Diagram showing a yellow smiley face in the center with bidirectional arrows connecting to six different circular faces arranged around it, with text above asking "What can this face be steered into?" The surrounding faces are labeled clockwise from top left: "Unicorn" (pink circle with yellow triangle horn and diamond earrings), "Cat" (gray circle with triangular ears and small nose), "Wrinkles" (beige circle with eyelashes and wrinkle lines), "Owl" (brown circle with large round eyes and small beak), "Lion" (orange circle with yellow inner face), and "Eye" (white circle with large black pupil and highlight

I'd love to see how this behaves if you jack up the feature for the Golden Gate Bridge.

# 25th October 2025, 3:08 am / svg, ai, generative-ai, llms, anthropic, interpretability, pelican-riding-a-bicycle

Tracing the thoughts of a large language model. In a follow-up to the research that brought us the delightful Golden Gate Claude last year, Anthropic have published two new papers about LLM interpretability:

To my own personal delight, neither of these papers are published as PDFs. They're both presented as glorious mobile friendly HTML pages with linkable sections and even some inline interactive diagrams. More of this please!

Screenshot of a multilingual language model visualization showing antonym prediction across three languages. Left panel shows English with prompt "The opposite of 'small' is'" predicting "large". Middle panel shows Chinese prompt "小"的反义词是" predicting "大 (zh: big)". Right panel shows French prompt "Le contraire de "petit" est" predicting "grand (fr: big)". Above shows activation analysis with token predictions and highlighted instances of "contraire" in French text.

# 27th March 2025, 9:51 pm / pdf, ai, generative-ai, llms, anthropic, claude, interpretability

2024

Extracting Concepts from GPT-4. A few weeks ago Anthropic announced they had extracted millions of understandable features from their Claude 3 Sonnet model.

Today OpenAI are announcing a similar result against GPT-4:

We used new scalable methods to decompose GPT-4’s internal representations into 16 million oft-interpretable patterns.

These features are "patterns of activity that we hope are human interpretable". The release includes code and a paper, Scaling and evaluating sparse autoencoders paper (PDF) which credits nine authors, two of whom - Ilya Sutskever and Jan Leike - are high profile figures that left OpenAI within the past month.

The most fun part of this release is the interactive tool for exploring features. This highlights some interesting features on the homepage, or you can hit the "I'm feeling lucky" button to bounce to a random feature. The most interesting I've found so far is feature 5140 which seems to combine God's approval, telling your doctor about your prescriptions and information passed to the Admiralty.

This note shown on the explorer is interesting:

Only 65536 features available. Activations shown on The Pile (uncopyrighted) instead of our internal training dataset.

Here's the full Pile Uncopyrighted, which I hadn't seen before. It's the standard Pile but with everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2 subsets removed.

# 6th June 2024, 8:54 pm / ai, openai, generative-ai, gpt-4, llms, interpretability, training-data

Golden Gate Claude. This is absurdly fun and weird. Anthropic's recent LLM interpretability research gave them the ability to locate features within the opaque blob of their Sonnet model and boost the weight of those features during inference.

For a limited time only they're serving a "Golden Gate Claude" model which has the feature for the Golden Gate Bridge boosted. No matter what question you ask it the Golden Gate Bridge is likely to be involved in the answer in some way. Click the little bridge icon in the Claude UI to give it a go.

I asked for names for a pet pelican and the first one it offered was this:

Golden Gate - This iconic bridge name would be a fitting moniker for the pelican with its striking orange color and beautiful suspension cables.

And from a recipe for chocolate covered pretzels:

Gently wipe any fog away and pour the warm chocolate mixture over the bridge/brick combination. Allow to air dry, and the bridge will remain accessible for pedestrians to walk along it.

UPDATE: I think the experimental model is no longer available, approximately 24 hours after release. We'll miss you, Golden Gate Claude.

# 24th May 2024, 8:17 am / ai, generative-ai, llms, anthropic, claude, interpretability, llm-release

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (via) Big advances in the field of LLM interpretability from Anthropic, who managed to extract millions of understandable features from their production Claude 3 Sonnet model (the mid-point between the inexpensive Haiku and the GPT-4-class Opus).

Some delightful snippets in here such as this one:

We also find a variety of features related to sycophancy, such as an empathy / “yeah, me too” feature 34M/19922975, a sycophantic praise feature 1M/847723, and a sarcastic praise feature 34M/19415708.

# 21st May 2024, 6:25 pm / ai, generative-ai, llms, anthropic, claude, interpretability

ColBERT query-passage scoring interpretability (via) Neat interactive visualization tool for understanding what the ColBERT embedding model does—this works by loading around 50MB of model files directly into your browser and running them with WebAssembly.

# 28th January 2024, 4:49 pm / ai, webassembly, embeddings, interpretability

2023

Decomposing Language Models Into Understandable Components. Anthropic appear to have made a major breakthrough with respect to the interpretability of Large Language Models:

“[...] we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand”

# 8th October 2023, 3:43 pm / ai, generative-ai, llms, anthropic, interpretability