19th April 2023 - Link Blog
Inside the secret list of websites that make AI chatbots sound smart. Washington Post story digging into the C4 dataset—Colossal Clean Crawled Corpus, a filtered version of Common Crawl that’s often used for training large language models. They include a neat interactive tool for searching a domain to see if it’s included—TIL that simonwillison.net is the 106,649th ranked site in C4 by number of tokens, 189,767 total—0.0001% of the total token volume in C4.
Recent articles
- Live blog: Code w/ Claude 2026 - 6th May 2026
- Vibe coding and agentic engineering are getting closer than I'd like - 6th May 2026
- LLM 0.32a0 is a major backwards-compatible refactor - 29th April 2026