Help scraping: track changes to CLI tools by recording their --help using Git
2nd February 2022
I’ve been experimenting with a new variant of Git scraping this week which I’m calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their --help
commands in a Git repository.
My new help-scraper GitHub repository is my first implementation of this pattern.
It uses this GitHub Actions workflow to record the --help
output for the Amazon Web Services aws
CLI tool, and also for the flyctl
tool maintained by the Fly.io hosting platform.
The workflow runs once a day. It loops through every available AWS command (using this script) and records the output of that command’s CLI help option to a .txt
file in the repository—then commits the result at the end.
The result is a version history of changes made to those help files. It’s essentially a much more detailed version of a changelog—capturing all sorts of details that might not be reflected in the official release notes for the tool.
Here’s an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.
Here are the official release notes—12 bullet points, spanning 12 different AWS services.
My help scraper caught the details of the release in this commit—89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what’s changed in a whole lot more detail.
The AWS CLI tool is enormous. Running find aws -name '*.txt' | wc -l
in that repository counts help pages for 11,401 individual commands—or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning’s new release.
There are plenty of other ways of tracking changes made to AWS. I’ve previously kept an eye on the botocore GitHub history, which exposes changes to the underlying JSON—and there are projects like awschanges.info which try to turn those sources of data into something more readable.
But I think there’s something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes with the detail I like from them!
I implemented this for flyctl
first, because I wanted to see what changes were being made that might impact my datasette-publish-fly plugin which shells out to that tool. Then I realized it could be applied to AWS as well.
Help scraping my own projects
I got the initial idea for this technique from a change I made to my Datasette and sqlite-utils projects a few weeks ago.
Both tools offer CLI commands with --help
output—but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.
So, I added documentation pages that list the output of --help
for each of the CLI commands, generated using the Cog file generation tool:
- sqlite-utils CLI reference (39 commands!)
- datasette CLI reference
Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the --help
output—here’s that history for sqlite-utils.
It was a short jump from that to the idea of combining it with Git scraping to generate history for other tools.
Bonus trick: GraphQL schema scraping
I’ve started making selective use of the Fly.io GraphQL API as part of my plugin for publishing Datasette instances to that platform.
Their GraphQL API is openly available, but it’s not extensively documented—presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: Using the undocumented Fly GraphQL API.
This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?
It turns out I can! There’s an NPM package called get-graphql-schema which can extract the GraphQL schema from any GraphQL server and write it out to disk:
npx get-graphql-schema https://api.fly.io/graphql > /tmp/fly.graphql
I’ve added that to my help-scraper
repository too—so now I have a commit history of changes of changes they are making there too. Here’s an example from this morning.
Other weeknotes
I’ve decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I’m trying to make at least one commit every day that takes me closer to that milestone.
This week I did a bunch of work adding a Link: https://...; rel="alternate"; type="application/datasette+json"
HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.
(I had originally planned to also support Accept: application/json
request headers for this, but I’ve been put off that idea by the discovery that Cloudflare deliberately ignores the Vary: Accept
header.)
Unrelated to Datasette: I also started a new Twitter thread, gathering behind the scenes material from the movie the Mitchells vs the Machines. There’s been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season—and I’ve been enjoying trying to tie it all together in a thread.
The last time I did this was for Into the Spider-Verse (from the same studio) and that thread ended up running for more than a year!
TIL this week
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024