226 posts tagged “ai-assisted-programming”
Using AI tools such as Large Language Models to help write code. Vibe coding is the less responsible subset of this. See Here’s how I use LLMs to help me write code for a description of my process.
2025
If you've been experimenting with OpenAI's Codex CLI and have been frustrated that it's not possible to select text and copy it to the clipboard, at least when running in the Mac terminal (I genuinely didn't know it was possible to build a terminal app that disabled copy and paste) you should know that they fixed that in this issue last week.
The new 0.20.0 version from three days ago also completely removes the old TypeScript codebase in favor of Rust. Even installations via NPM now get the Rust version.
I originally installed Codex via Homebrew, so I had to run this command to get the updated version:
brew upgrade codex
Another Codex tip: to use GPT-5 (or any other specific OpenAI model) you can run it like this:
export OPENAI_DEFAULT_MODEL="gpt-5"
codex
This no longer works, see update below.
I've been using a codex-5
script on my PATH containing this, because sometimes I like to live dangerously!
#!/usr/bin/env zsh
# Usage: codex-5 [additional args passed to `codex`]
export OPENAI_DEFAULT_MODEL="gpt-5"
exec codex --dangerously-bypass-approvals-and-sandbox "$@"
Update: It looks like GPT-5 is the default model in v0.20.0 already.
Also the environment variable I was using no longer does anything, it was removed in this commit (I used Codex Web to help figure that out). You can use the -m model_id
command-line option instead.
You know what else we noticed in the interviews? Developers rarely mentioned “time saved” as the core benefit of working in this new way with agents. They were all about increasing ambition. We believe that means that we should update how we talk about (and measure) success when using these tools, and we should expect that after the initial efficiency gains our focus will be on raising the ceiling of the work and outcomes we can accomplish, which is a very different way of interpreting tool investments.
— Thomas Dohmke, CEO, GitHub
Jules, our asynchronous coding agent, is now available for everyone (via) I wrote about the Jules beta back in May. Google's version of the OpenAI Codex PR-submitting hosted coding tool graduated from beta today.
I'm mainly linking to this now because I like the new term they are using in this blog entry: Asynchronous coding agent. I like it so much I gave it a tag.
I continue to avoid the term "agent" as infuriatingly vague, but I can grudgingly accept it when accompanied by a prefix that clarifies the type of agent we are talking about. "Asynchronous coding agent" feels just about obvious enough to me to be useful.
... I just ran a Google search for "asynchronous coding agent" -jules
and came up with a few more notable examples of this name being used elsewhere:
- Introducing Open SWE: An Open-Source Asynchronous Coding Agent is an announcement from LangChain just this morning of their take on this pattern. They provide a hosted version (bring your own API keys) or you can run it yourself with their MIT licensed code.
- The press release for GitHub's own version of this GitHub Introduces Coding Agent For GitHub Copilot states that "GitHub Copilot now includes an asynchronous coding agent".
No, AI is not Making Engineers 10x as Productive (via) Colton Voege on "curing your AI 10x engineer imposter syndrome".
There's a lot of rhetoric out there suggesting that if you can't 10x your productivity through tricks like running a dozen Claude Code instances at once you're falling behind. Colton's piece here is a pretty thoughtful exploration of why that likely isn't true. I found myself agreeing with quite a lot of this article.
I'm a pretty huge proponent for AI-assisted development, but I've never found those 10x claims convincing. I've estimated that LLMs make me 2-5x more productive on the parts of my job which involve typing code into a computer, which is itself a small portion of that I do as a software engineer.
That's not too far from this article's assumptions. From the article:
I wouldn't be surprised to learn AI helps many engineers do certain tasks 20-50% faster, but the nature of software bottlenecks mean this doesn't translate to a 20% productivity increase and certainly not a 10x increase.
I think that's an under-estimation - I suspect engineers that really know how to use this stuff effectively will get more than a 0.2x increase - but I do think all of the other stuff involved in building software makes the 10x thing unrealistic in most cases.
The old timers who built the early web are coding with AI like it's 1995.
Think about it: They gave blockchain the sniff test and walked away. Ignored crypto (and yeah, we're not rich now). NFTs got a collective eye roll.
But AI? Different story. The same folks who hand-coded HTML while listening to dial-up modems sing are now vibe-coding with the kids. Building things. Breaking things. Giddy about it.
We Gen X'ers have seen enough gold rushes to know the real thing. This one's got all the usual crap—bad actors, inflated claims, VCs throwing money at anything with "AI" in the pitch deck. Gross behavior all around. Normal for a paradigm shift, but still gross.
The people who helped wire up the internet recognize what's happening. When the folks who've been through every tech cycle since gopher start acting like excited newbies again, that tells you something.
Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM
Qwen just released their sixth model(!) of this July called Qwen3-Coder-30B-A3B-Instruct—listed as Qwen3-Coder-Flash in their chat.qwen.ai interface.
[... 1,390 words]When you vibe code, you are incurring tech debt as fast as the LLM can spit it out. Which is why vibe coding is perfect for prototypes and throwaway projects: It's only legacy code if you have to maintain it! [...]
The worst possible situation is to have a non-programmer vibe code a large project that they intend to maintain. This would be the equivalent of giving a credit card to a child without first explaining the concept of debt. [...]
If you don't understand the code, your only recourse is to ask AI to fix it for you, which is like paying off credit card debt with another credit card.
— Steve Krouse, Vibe code is legacy code
My 2.5 year old laptop can write Space Invaders in JavaScript now, using GLM-4.5 Air and MLX
I wrote about the new GLM-4.5 model family yesterday—new open weight (MIT licensed) models from Z.ai in China which their benchmarks claim score highly in coding even against models such as Claude Sonnet 4.
[... 685 words]Using GitHub Spark to reverse engineer GitHub Spark
GitHub Spark was released in public preview yesterday. It’s GitHub’s implementation of the prompt-to-app pattern also seen in products like Claude Artifacts, Lovable, Vercel v0, Val Town Townie and Fly.io’s Phoenix New. In this post I reverse engineer Spark and explore its fascinating system prompt in detail.
[... 3,900 words]Introducing OSS Rebuild: Open Source, Rebuilt to Last (via) Major news on the Reproducible Builds front: the Google Security team have announced OSS Rebuild, their project to provide build attestations for open source packages released through the NPM, PyPI and Crates ecosystom (and more to come).
They currently run builds against the "most popular" packages from those ecosystems:
Through automation and heuristics, we determine a prospective build definition for a target package and rebuild it. We semantically compare the result with the existing upstream artifact, normalizing each one to remove instabilities that cause bit-for-bit comparisons to fail (e.g. archive compression). Once we reproduce the package, we publish the build definition and outcome via SLSA Provenance. This attestation allows consumers to reliably verify a package's origin within the source history, understand and repeat its build process, and customize the build from a known-functional baseline
The only way to interact with the Rebuild data right now is through their Go CLI tool. I reverse-engineered it using Gemini 2.5 Pro and derived this command to get a list of all of their built packages:
gsutil ls -r 'gs://google-rebuild-attestations/**'
There are 9,513 total lines, here's a Gist. I used Claude Code to count them across the different ecosystems (discounting duplicates for different versions of the same package):
- pypi: 5,028 packages
- cratesio: 2,437 packages
- npm: 2,048 packages
Then I got a bit ambitious... since the files themselves are hosted in a Google Cloud Bucket, could I run my own web app somewhere on storage.googleapis.com
that could use fetch()
to retrieve that data, working around the lack of open CORS headers?
I got Claude Code to try that for me (I didn't want to have to figure out how to create a bucket and configure it for web access just for this one experiment) and it built and then deployed https://storage.googleapis.com/rebuild-ui/index.html, which did indeed work!
It lets you search against that list of packages from the Gist and then select one to view the pretty-printed newline-delimited JSON that was stored for that package.
The output isn't as interesting as I was expecting, but it was fun demonstrating that it's possible to build and deploy web apps to Google Cloud that can then make fetch()
requests to other public buckets.
Hopefully the OSS Rebuild team will add a web UI to their project at some point in the future.
Qwen3-Coder: Agentic Coding in the World (via) It turns out that as I was typing up my notes on Qwen3-235B-A22B-Instruct-2507 the Qwen team were unleashing something much bigger:
Today, we’re announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct — a 480B-parameter Mixture-of-Experts model with 35B active parameters which supports the context length of 256K tokens natively and 1M tokens with extrapolation methods, offering exceptional performance in both coding and agentic tasks.
This is another Apache 2.0 licensed open weights model, available as Qwen3-Coder-480B-A35B-Instruct and Qwen3-Coder-480B-A35B-Instruct-FP8 on Hugging Face.
I used qwen3-coder-480b-a35b-instruct on the Hyperbolic playground to run my "Generate an SVG of a pelican riding a bicycle" test prompt:
I actually slightly prefer the one I got from qwen3-235b-a22b-07-25.
It's also available as qwen3-coder on OpenRouter.
In addition to the new model, Qwen released their own take on an agentic terminal coding assistant called qwen-code, which they describe in their blog post as being "Forked from Gemini Code" (they mean gemini-cli) - which is Apache 2.0 so a fork is in keeping with the license.
They focused really hard on code performance for this release, including generating synthetic data tested using 20,000 parallel environments on Alibaba Cloud:
In the post-training phase of Qwen3-Coder, we introduced long-horizon RL (Agent RL) to encourage the model to solve real-world tasks through multi-turn interactions using tools. The key challenge of Agent RL lies in environment scaling. To address this, we built a scalable system capable of running 20,000 independent environments in parallel, leveraging Alibaba Cloud’s infrastructure. The infrastructure provides the necessary feedback for large-scale reinforcement learning and supports evaluation at scale. As a result, Qwen3-Coder achieves state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling.
To further burnish their coding credentials, the announcement includes instructions for running their new model using both Claude Code and Cline using custom API base URLs that point to Qwen's own compatibility proxies.
Pricing for Qwen's own hosted models (through Alibaba Cloud) looks competitive. This is the first model I've seen that sets different prices for four different sizes of input:
This kind of pricing reflects how inference against longer inputs is more expensive to process. Gemini 2.5 Pro has two different prices for above or below 200,00 tokens.
Awni Hannun reports running a 4-bit quantized MLX version on a 512GB M3 Ultra Mac Studio at 24 tokens/second using 272GB of RAM, getting great results for "write a python script for a bouncing yellow ball within a square, make sure to handle collision detection properly. make the square slowly rotate. implement it in python. make sure ball stays within the square
".
Coding with LLMs in the summer of 2025 (an update) (via) Salvatore Sanfilippo describes his current AI-assisted development workflow. He's all-in on LLMs for code review, exploratory prototyping, pair-design and writing "part of the code under your clear specifications", but warns against leaning too hard on pure vibe coding:
But while LLMs can write part of a code base with success (under your strict supervision, see later), and produce a very sensible speedup in development (or, the ability to develop more/better in the same time used in the past — which is what I do), when left alone with nontrivial goals they tend to produce fragile code bases that are larger than needed, complex, full of local minima choices, suboptimal in many ways. Moreover they just fail completely when the task at hand is more complex than a given level.
There are plenty of useful tips in there, especially around carefully managing your context:
When your goal is to reason with an LLM about implementing or fixing some code, you need to provide extensive information to the LLM: papers, big parts of the target code base (all the code base if possible, unless this is going to make the context window so large than the LLM performances will be impaired). And a brain dump of all your understanding of what should be done.
Salvatore warns against relying too hard on tools which hide the context for you, like editors with integrated coding agents. He prefers pasting exactly what's needed into the LLM web interface - I share his preference there.
His conclusions here match my experience:
You will be able to do things that are otherwise at the borders of your knowledge / expertise while learning much in the process (yes, you can learn from LLMs, as you can learn from books or colleagues: it is one of the forms of education possible, a new one). Yet, everything produced will follow your idea of code and product, and will be of high quality and will not random fail because of errors and shortcomings introduced by the LLM. You will also retain a strong understanding of all the code written and its design.
Every day someone becomes a programmer because they figured out how to make ChatGPT build something. Lucky for us: in many of those cases the AI picks Python. We should treat this as an opportunity and anticipate an expansion in the kinds of people who might want to attend a Python conference. Yet many of these new programmers are not even aware that programming communities and conferences exist. It’s in the Python community’s interest to find ways to pull them in.
There’s a bigger opportunity in computer science and programming (academically conveyed or self-taught) now than ever before, by far, in my opinion. The move to AI is like replacing shovels with bulldozers. Every business will benefit from this and they’ll need people to do it.
— Tim Sweeney, Epic Games
Vibe scraping and vibe coding a schedule app for Open Sauce 2025 entirely on my phone
This morning, working entirely on my phone, I scraped a conference website and vibe coded up an alternative UI for interacting with the schedule using a combination of OpenAI Codex and Claude Artifacts.
[... 2,189 words]Application development without programmers (via) This book by James Martin published in 1982, includes the following in the preface:
Applications development did not change much for 20 years, but now a new wave is crashing in. A rich diversity of nonprocedural techniques and languages are emerging. As these languages improve, they promise to change the entire fabric of DP development.
This means a major change for many of the personnel involved in DP, from the DP manager to the junior programmer. DP personnel have always welcomed new hardware and software, but it is not as easy to accept fundamental changes in the nature of one's job. Many DP professionals and, not surprisingly, programmers will instinctively resist some of the methods described in this book.
(I had to look up DP - it stands for Data Processing, and was a common acronym for general IT work up until the 1980s.)
I enjoy they way this echoes with today's fears of the impact of AI-assisted programming on developer careers!
The early 80s were a wild time for computing:
Unfortunately, the winds of change are sometimes irreversible. The continuing drop in cost of computers has now passed the point at which computers have become cheaper than people. The number of programmers available per computer is shrinking so fast that most computers in the future will have to work at least in part without programmers.
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (via) METR - for Model Evaluation & Threat Research - are a non-profit research institute founded by Beth Barnes, a former alignment researcher at OpenAI (see Wikipedia). They've previously contributed to system cards for OpenAI and Anthropic, but this new research represents a slightly different direction for them:
We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.
The full paper (PDF) has a lot of details that are missing from the linked summary.
METR recruited 16 experienced open source developers for their study, with varying levels of exposure to LLM tools. They then assigned them tasks from their own open source projects, randomly assigning whether AI was allowed or not allowed for each of those tasks.
They found a surprising difference between developer estimates and actual completion times:
After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%—AI tooling slowed developers down.
I shared my initial intuition about this paper on Hacker News the other day:
My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.
They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.
So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.
A quarter of the participants saw increased performance, 3/4 saw reduced performance.
One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:
However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.
My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.
I got an insightful reply there from Nate Rush, one of the authors of the study, which included these notes:
- Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.
- Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.
- Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!
- We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.
- As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.
In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).
Here's their table of the most likely factors:
I think Nate's right that jumping straight to a conclusion about a single factor is a shallow and unproductive way to think about this report.
That said, I can't resist the temptation to do exactly that! The factor that stands out most to me is that these developers were all working in repositories they have a deep understanding of already, presumably on non-trivial issues since any trivial issues are likely to have been resolved in the past.
I think this is a really interesting paper. Measuring developer productivity is notoriously difficult. I hope this paper inspires more work with a similar level of detail to analyzing how professional programmers spend their time:
To compare how developers spend their time with and without AI assistance, we manually label a subset of 128 screen recordings with fine-grained activity labels, totaling 143 hours of video.
I Shipped a macOS App Built Entirely by Claude Code (via) Indragie Karunaratne has "been building software for the Mac since 2008", but recently decided to try Claude Code to build a side project: Context, a native Mac app for debugging MCP servers:
There is still skill and iteration involved in helping Claude build software, but of the 20,000 lines of code in this project, I estimate that I wrote less than 1,000 lines by hand.
It's a good looking native app:
This is a useful, detailed write-up. A few notes on things I picked up:
- Claude is great at SwiftUI and mostly good at Swift, but gets confused by the newer Swift Concurrency mechanisms.
- Claude occasionally triggers “The compiler is unable to type-check this expression in reasonable time” errors, but is able to recover by refactoring view bodies into smaller expressions.
- Telling Claude to make native macOS interfaces “more beautiful/elegant/usable” works surprisingly well. I’ve seen the same with web frontend code.
- Claude Code’s build/test/debug agentic coding loop works great for Swift apps, but there isn’t a good equivalent to Playwright yet so you need to manually take over to interact with the UI and drop in screenshots of any problems.
- Claude is great at creating mock data:
The first screenshots of the app that I shared with friends as I dialed in the UI were backed by mock data, but it looked real enough that you could get a good sense of how the app would look when rendering data from real MCP servers.
Indragie’s focus throughout this piece is on using LLM tools to help close that last 20% of a side project that usually prevents it from being shipped.
The most exciting thing about this entire journey for me is not the app I built, but that I am now able to scratch my coding itch and ship polished side projects again. It's like I found an extra 5 hours every day, and all it cost me was $200 a month.
Cursor: Clarifying Our Pricing. Cursor changed their pricing plan on June 16th, introducing a new $200/month Ultra plan with "20x more usage than Pro" and switching their $20/month Pro plan from "request limits to compute limits".
This confused a lot of people. Here's Cursor's attempt at clarifying things:
Cursor uses a combination of our custom models, as well as models from providers like OpenAI, Anthropic, Google, and xAI. For external models, we previously charged based on the number of requests made. There was a limit of 500 requests per month, with Sonnet models costing two requests.
New models can spend more tokens per request on longer-horizon tasks. Though most users' costs have stayed fairly constant, the hardest requests cost an order of magnitude more than simple ones. API-based pricing is the best way to reflect that.
I think I understand what they're saying there. They used to allow you 500 requests per month, but those requests could be made against any model and, crucially, a single request could trigger a variable amount of token spend.
Modern LLMs can have dramatically different prices, so one of those 500 requests with a large context query against an expensive model could cost a great deal more than a single request with a shorter context against something less expensive.
I imagine they were losing money on some of their more savvy users, who may have been using prompting techniques that sent a larger volume of tokens through each one of those precious 500 requests.
The new billing switched to passing on the expense of those tokens directly, with a $20 included budget followed by overage charges for tokens beyond that.
It sounds like a lot of people, used to the previous model where their access would be cut off after 500 requests, got caught out by this and racked up a substantial bill!
To cursor's credit, they're offering usage refunds to "those with unexpected usage between June 16 and July 4."
I think this highlights a few interesting trends.
Firstly, the era of VC-subsidized tokens may be coming to an end, especially for products like Cursor which are way past demonstrating product-market fit.
Secondly, that $200/month plan for 20x the usage of the $20/month plan is an emerging pattern: Anthropic offers the exact same deal for Claude Code, with the same 10x price for 20x usage multiplier.
Professional software engineers may be able to justify one $200/month subscription, but I expect most will be unable to justify two. The pricing here becomes a significant form of lock-in - once you've picked your $200/month coding assistant you are less likely to evaluate the alternatives.
The more time I spend using LLMs for code, the less I worry for my career - even as their coding capabilities continue to improve.
Using LLMs as part of my process helps me understand how much of my job isn't just bashing out code.
My job is to identify problems that can be solved with code, then solve them, then verify that the solution works and has actually addressed the problem.
A more advanced LLM may eventually be able to completely handle the middle piece. It can help with the first and last pieces, but only when operated by someone who understands both the problems to be solved and how to interact with the LLM to help solve them.
No matter how good these things get, they will still need someone to find problems for them to solve, define those problems and confirm that they are solved. That's a job - one that other humans will be happy to outsource to an expert practitioner.
It's also about 80% of what I do as a software developer already.
I think that a lot of resistance to AI coding tools comes from the same place: fear of losing something that has defined you for so long. People are reacting against overblown hype, and there is overblown hype. I get that, but I also think there’s something deeper going on here. When you’ve worked hard to build your skills, when coding is part of your identity and where you get your worth, the idea of a tool that might replace some of that is very threatening.
— Adam Gordon Bell, When AI Codes, What’s Left for me?
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.
The challenge then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
My x86 assembly experiment with Claude Code was the thing that made this click for me.
Quitting programming as a career right now because of LLMs would be like quitting carpentry as a career thanks to the invention of the table saw.
Mandelbrot in x86 assembly by Claude. Inspired by a tweet asking if Claude knew x86 assembly, I decided to run a bit of an experiment.
I prompted Claude Sonnet 4:
Write me an ascii art mandelbrot fractal generator in x86 assembly
And got back code that looked... like assembly code I guess?
So I copied some jargon out of that response and asked:
I have some code written for x86-64 assembly using NASM syntax, targeting Linux (using system calls for output).
How can I run that on my Mac?
That gave me a Dockerfile.
I tried running it on my Mac and... it failed to compile.
So I fired up Claude Code (with the --dangerously-skip-permissions
option) in that directory and told it what to run:
Run this: docker build -t myasm .
It started crunching. It read the errors, inspected the assembly code, made changes, tried running it again in a loop, added more comments...
...and eventually produced an ASCII art image of a fractal! Claude Code proudly declared:
Success! The Docker container ran and generated a Mandelbrot set visualization using ASCII characters. The output shows all @ symbols, which indicates the algorithm is working but all points are inside the set (high iteration count). The program compiled and executed successfully on x86_64 architecture via emulation.
Yeah... I was not satisfied with an image that was entirely @ symbols! I said:
That is a bad result - it is just @ signs, I cannot see the fractal
Claude made a few more changes and gave me the result I wanted:
Here's the finished assembly code, the Dockerfile to run it on a Mac and the full transcript of the Claude Code session that got it there.
TIL: Using Playwright MCP with Claude Code. Inspired by Armin ("I personally use only one MCP - I only use Playwright") I decided to figure out how to use the official Playwright MCP server with Claude Code.
It turns out it's easy:
claude mcp add playwright npx '@playwright/mcp@latest'
claude
The claude mcp add
command only affects the current directory by default - it gets persisted in the ~/.claude.json
file.
Now Claude can use Playwright to automate a Chrome browser! Tell it to "Use playwright mcp to open a browser to example.com" and watch it go - it can navigate pages, submit forms, execute custom JavaScript and take screenshots to feed back into the LLM.
The browser window stays visible which means you can interact with it too, including signing into websites so Claude can act on your behalf.
One of the best examples of LLM developer tooling I've heard is from a team that supports software from the 80s-90s. Their only source of documentation is video interviews with retired employees. So they feed them into transcription software and get summarized searchable notes out the other end.
— Kevin Webb, a couple million lines of Smalltalk
Using LLMs for code archaeology is pretty fun.
I stumbled across this blog entry from 2003 today, in which I had gotten briefly excited about ColdFusion and implemented an experimental PHP template engine that used XML tags to achieve a similar effect:
<h1>%title%</h1> <sql id="recent"> select title from entries order by added desc limit 0, %limit% </sql> <ul> <output sql="recent"> <li>%title%</li> </output> </ul>
I'd completely forgotten about this, and in scanning through the PHP it looked like it had extra features that I hadn't described in the post.
So... I fed my 22 year old TemplateParser.class.php
file into Claude and prompted:
Write detailed markdown documentation for this template language
Here's the resulting documentation. It's pretty good, but the highlight was the Claude transcript which concluded:
This appears to be a custom template system from the mid-2000s era, designed to separate presentation logic from PHP code while maintaining database connectivity for dynamic content generation.
Mid-2000s era indeed!
To misuse a woodworking metaphor, I think we’re experiencing a shift from hand tools to power tools.
You still need someone who understands the basics to get the good results out of the tools, but they’re not chiseling fine furniture by hand anymore, they’re throwing heaps of wood through the tablesaw instead. More productive, but more likely to lose a finger if you’re not careful.
— mrmincent, Hacker News comment on Claude Code
Using Claude Code to build a GitHub Actions workflow. I wanted to add a small feature to one of my GitHub repos - an automatically updated README index listing other files in the repo - so I decided to use Descript to record my process using Claude Code. Here's a 7 minute video showing what I did.
I've been wanting to start producing more video content for a while - this felt like a good low-stakes opportunity to put in some reps.
microsoft/vscode-copilot-chat (via) As promised at Build 2025 in May, Microsoft have released the GitHub Copilot Chat client for VS Code under an open source (MIT) license.
So far this is just the extension that provides the chat component of Copilot, but the launch announcement promises that Copilot autocomplete will be coming in the near future:
Next, we will carefully refactor the relevant components of the extension into VS Code core. The original GitHub Copilot extension that provides inline completions remains closed source -- but in the following months we plan to have that functionality be provided by the open sourced GitHub Copilot Chat extension.
I've started spelunking around looking for the all-important prompts. So far the most interesting I've found are in prompts/node/agent/agentInstructions.tsx, with a <Tag name='instructions'>
block that starts like this:
You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks. The user will ask a question, or ask you to perform a task, and it may require lots of research to answer correctly. There is a selection of tools that let you perform actions or retrieve helpful context to answer the user's question.
There are tool use instructions - some edited highlights from those:
When using the ReadFile tool, prefer reading a large section over calling the ReadFile tool many times in sequence. You can also think of all the pieces you may be interested in and read them in parallel. Read large enough context to ensure you get what you need.
You can use the FindTextInFiles to get an overview of a file by searching for a string within that one file, instead of using ReadFile many times.
Don't call the RunInTerminal tool multiple times in parallel. Instead, run one command and wait for the output before running the next command.
After you have performed the user's task, if the user corrected something you did, expressed a coding preference, or communicated a fact that you need to remember, use the UpdateUserPreferences tool to save their preferences.
NEVER try to edit a file by running terminal commands unless the user specifically asks for it.
Use the ReplaceString tool to replace a string in a file, but only if you are sure that the string is unique enough to not cause any issues. You can use this tool multiple times per file.
That file also has separate CodesearchModeInstructions, as well as a SweBenchAgentPrompt class with a comment saying that it is "used for some evals with swebench".
Elsewhere in the code, prompt/node/summarizer.ts illustrates one of their approaches to Context Summarization, with a prompt that looks like this:
You are an expert at summarizing chat conversations.
You will be provided:
- A series of user/assistant message pairs in chronological order
- A final user message indicating the user's intent.
[...]
Structure your summary using the following format:
TITLE: A brief title for the summary
USER INTENT: The user's goal or intent for the conversation
TASK DESCRIPTION: Main technical goals and user requirements
EXISTING: What has already been accomplished. Include file paths and other direct references.
PENDING: What still needs to be done. Include file paths and other direct references.
CODE STATE: A list of all files discussed or modified. Provide code snippets or diffs that illustrate important context.
RELEVANT CODE/DOCUMENTATION SNIPPETS: Key code or documentation snippets from referenced files or discussions.
OTHER NOTES: Any additional context or information that may be relevant.
prompts/node/panel/terminalQuickFix.tsx looks interesting too, with prompts to help users fix problems they are having in the terminal:
You are a programmer who specializes in using the command line. Your task is to help the user fix a command that was run in the terminal by providing a list of fixed command suggestions. Carefully consider the command line, output and current working directory in your response. [...]
That file also has a PythonModuleError prompt:
Follow these guidelines for python:
- NEVER recommend using "pip install" directly, always recommend "python -m pip install"
- The following are pypi modules: ruff, pylint, black, autopep8, etc
- If the error is module not found, recommend installing the module using "python -m pip install" command.
- If activate is not available create an environment using "python -m venv .venv".
There's so much more to explore in here. xtab/common/promptCrafting.ts looks like it may be part of the code that's intended to replace Copilot autocomplete, for example.
The way it handles evals is really interesting too. The code for that lives in the test/ directory. There's a lot of it, so I engaged Gemini 2.5 Pro to help figure out how it worked:
git clone https://github.com/microsoft/vscode-copilot-chat
cd vscode-copilot-chat/chat
files-to-prompt -e ts -c . | llm -m gemini-2.5-pro -s \
'Output detailed markdown architectural documentation explaining how this test suite works, with a focus on how it tests LLM prompts'
Here's the resulting generated documentation, which even includes a Mermaid chart (I had to save the Markdown in a regular GitHub repository to get that to render - Gists still don't handle Mermaid.)
The neatest trick is the way it uses a SQLite-based caching mechanism to cache the results of prompts from the LLM, which allows the test suite to be run deterministically even though LLMs themselves are famously non-deterministic.