Decomposing Language Models Into Understandable Components. Anthropic appear to have made a major breakthrough with respect to the interpretability of Large Language Models:
“[...] we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand”
Recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024