Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson
26th November 2025
I talked with CL Kao and Dori Wilson for an episode of their new Data Renegades podcast titled Data Journalism Unleashed with Simon Willison.
I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I’m using what it produced almost verbatim here—I tidied it up a tiny bit and added a bunch of supporting links.
-
What is data journalism and why it’s the most interesting application of data analytics [02:03]
“There’s this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It’s effectively data analytics, but applied to the world of news gathering. And I think it’s fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist.”
-
The origin story of Django at a small Kansas newspaper [02:31]
"We had a year’s paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty. And at the time we thought we were building a content management system."
-
Building the “Downloads Page”—a dynamic radio player of local bands [03:24]
"Adrian built a feature of the site called the Downloads Page. And what it did is it said, okay, who are the bands playing at venues this week? And then we’ll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week."
-
Working at The Guardian on data-driven reporting projects [04:44]
“I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you’re presenting, I just feel that’s a great way of building more credibility in the reporting process.”
-
Washington Post’s opioid crisis data project and sharing with local newspapers [05:22]
"Something the Washington Post did that I thought was extremely forward thinking is that they shared [the opioid files] with other newspapers. They said, ’Okay, we’re a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?’"
-
NICAR conference and the collaborative, non-competitive nature of data journalism [07:00]
“It’s all about trying to figure out what is the most value we can get out of this technology as an industry as a whole.”
-
ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02]
"The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it’s a very healthy newsroom. They do amazing data reporting... And I believe they’re almost breaking even on subscription revenue [correction, not yet], which is astonishing."
-
The “shower revelation” that led to Datasette—SQLite on serverless hosting [10:31]
“It was literally a shower revelation. I was in the shower thinking about serverless and I thought, ’hang on a second. So you can’t use Postgres on serverless hosting, but if it’s a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?’”
-
Datasette’s plugin ecosystem and the vision of solving data publishing [12:36]
“In the past I’ve thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who’s going to solve data like publishing tables full of data on the internet? So that was my original goal.”
-
Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59]
“Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York.”
-
Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40]
“It turns out the Russian FSB, their secret police, have an office that’s not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I’m like, ’Wow, that’s going to get me thrown out of a window.’”
Bellingcat: Food Delivery Leak Unmasks Russian Security Agents
-
The frustration of open source: no feedback on how people use your software [16:14]
“An endless frustration in open source is that you really don’t get the feedback on what people are actually doing with it.”
-
Open office hours on Fridays to learn how people use Datasette [16:49]
"I have an open office hours Calendly, where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that’s been a revelation. I’ve had hundreds of conversations in the past few years with people."
-
Data cleaning as the universal complaint—95% of time spent cleaning [17:34]
“I know every single person I talk to in data complains about the cleaning that everyone says, ’I spend 95% of my time cleaning the data and I hate it.’”
-
Version control problems in data teams—Python scripts on laptops without Git [17:43]
“I used to work for a large company that had a whole separate data division and I learned at one point that they weren’t using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly.”
-
The Carpentries organization teaching scientists Git and software fundamentals [18:12]
"There’s an organization called The Carpentries. Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that."
-
Data documentation as an API contract problem [21:11]
“A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren’t doing it.”
-
The importance of “view source” on business reports [23:21]
“If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I’m thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%.”
-
Fact-checking process for data reporting [24:16]
“Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it’s the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they’re confident enough to publish them.”
-
Queries as first-class citizens with version history and comments [27:16]
“I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that’s changed over time and be able to post comments on it.”
-
Two types of documentation: official docs vs. temporal/timestamped notes [29:46]
“There’s another type of documentation which I call temporal documentation where effectively it’s stuff where you say, ’Okay, it’s Friday, the 31st of October and this worked.’ But the timestamp is very prominent and if somebody looks that in six months time, there’s no promise that it’s still going to be valid to them.”
-
Starting an internal blog without permission—instant credibility [30:24]
“The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it.”
-
Building a search engine across seven documentation systems [31:35]
“It turns out, once you get a search engine over the top, it’s good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company.”
-
The TIL (Today I Learned) blog approach—celebrating learning basics [33:05]
"I’ve done TILs about ’for loops’ in Bash, right? Because okay, everyone else knows how to do that. I didn’t... It’s a value statement where I’m saying that if you’ve been a professional software engineer for 25 years, you still don’t know everything. You should still celebrate figuring out how to learn ’for loops’ in Bash."
-
Coding agents like Claude Code and their unexpected general-purpose power [34:53]
“They pretend to be programming tools but actually they’re basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything.”
-
Skills for Claude—markdown files for census data, visualization, newsroom standards [36:16]
“Imagine a markdown file for census data. Here’s where to get census data from. Here’s what all of the columns mean. Here’s how to derive useful things from that. And then you have another skill for here’s how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this.”
-
The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22]
“The terminal is now accessible to people who never learned the terminal before ’cause you don’t have to remember all the commands because the LLM knows the commands for you. But isn’t that fascinating that the cutting edge software right now is it’s like 1980s style— I love that. It’s not going to last. That’s a current absurdity for 2025.”
-
Cursor for data? Generic agent loops vs. data-specific IDEs [38:18]
“More of a notebook interface makes a lot more sense than a Claude Code style terminal ’cause a Jupyter Notebook is effectively a terminal, it’s just in your browser and it can show you charts.”
-
Future of BI tools: prompt-driven, instant dashboard creation [39:54]
“You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we’ll do a bar chart over time and these numbers feel big so we’ll put those in a big green box.”
-
Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06]
“LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff.”
-
LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36]
“You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, ’here’s a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,’ and they just do it.”
-
Data enrichment: running cheap models in loops against thousands of records [44:36]
“There’s something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well.”
-
Multimodal LLMs for images, audio transcription, and video processing [45:42]
“At one point I calculated that using Google’s least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive.”
Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents
-
First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54]
“I hated C++ ’cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something.”
-
Biggest production bug: crashing The Guardian’s MPs expenses site with a progress bar [47:46]
“I tweeted a screenshot of that progress bar and said, ’Hey, look, we have a progress bar.’ And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar.”
-
Favorite test dataset: San Francisco’s tree list, updated several times a week [48:44]
"There’s 195,000 trees in this CSV file and it’s got latitude and longitude and species and age when it was planted... and get this, it’s updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can’t figure out who."
-
Showrunning TV shows as a management model—transferring vision to lieutenants [50:07]
“Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I’m like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them.”
The Eleven Laws of Showrunning by Javier Grillo-Marxuach
-
Hot take: all executable code with business value must be in version control [52:21]
“I think it’s inexcusable to have executable code that has business value that is not in version control somewhere.”
-
Hacker News automation: GitHub Actions scraping for notifications [52:45]
"I’ve got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire."
-
Dream project: whale detection camera with Gemini AI [53:47]
“I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there’s a whale.”
-
Favorite podcast: Mark Steel’s in Town (hyperlocal British comedy) [54:23]
“Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing.”
Mark Steel’s in Town available episodes
-
Favorite fiction genre: British wizards caught up in bureaucracy [55:06]
“My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings.”
Colophon
I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included <span data-timestamp="425"> elements. The project uses the following custom instructions
You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript—quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes—long quotes are fine.
I then added a follow-up prompt saying:
Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end
Then suggest a very comprehensive list of supporting links I could find
Here’s the full Claude transcript of the analysis.
More recent articles
- Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult - 24th November 2025
- sqlite-utils 4.0a1 has several (minor) backwards incompatible changes - 24th November 2025