Simon Willison’s Weblog

Subscribe

February 2019

Feb. 1, 2019

Datasette 0.27 (via) The latest release of Datasette introduces an option to output tables and SQL query results as newline-delimited JSON—plus a new “datasette plugins” command for listing available plugins.

# 4:39 am / projects, json, datasette

The Datasette Ecosystem. I’ve written a page of documentation that introduces the wider Datasette Ecosystem: csvs-to-sqlite, sqlite-utils, db-to-sqlite, dbf-to-sqlite, markdown-to-sqlite and a full collection of Datasette plugins.

# 4:41 am / datasette, sqlite, sqlite-utils

Feb. 6, 2019

Questions for a new technology. Kellan poses 8 questions which should be asked of any technology that is being proposed for inclusion in an existing tech stack. I’m particularly fond of “Will this solution kill and eat the solution that it replaces?”. My rule of thumb these days is that new technology either needs to make something possible that isn’t possible at all with the existing stack, or it needs to represent at least a 3X productivity improvement in order to compensate for the switching and retraining costs across a large team.

# 4:10 am / kellan-elliott-mccrea

Feb. 8, 2019

db-to-sqlite (via) I just released version 0.2 of a tiny CLI utility I’ve been working on. It builds on top of SQLAlchemy and lets you connect to any SQLAlchemy-supported database and convert the data from it to a local SQLite database file. The new --all option will mirror all available tables (including foreign key relationships), or you can use --sql to save the results of custom SQL queries.

# 6:08 am / projects, sqlite

socrata2sql (via) Phenomenal new open source tool released by Andrew Chavez at the Dallas Morning News. Socrata is the open data portal software used by huge numbers of local governments worldwide. socrata2sql is a tool that interacts with the standard Socrata API and can use it to suck down a dataset and save it as a SQLite, PostgreSQL, MySQL or other SQLAlchemy-supported database. I just tried this and it took a single command to create a SQLite database of every police arrest in Dallas in the past five years.

# 3:27 pm / datasette, data-journalism, sqlite

Feb. 12, 2019

Private blockchains are completely uninteresting. (By this, I mean systems that use the blockchain data structure but don't have the above three elements.) In general, they have some external limitation on who can interact with the blockchain and its features. These are not anything new; they're distributed append-only data structures with a list of individuals authorized to add to it. Consensus protocols have been studied in distributed systems for more than 60 years. Append-only data structures have been similarly well covered. They're blockchains in name only, and -- as far as I can tell -- the only reason to operate one is to ride on the blockchain hype.

Bruce Schneier

# 7:14 pm / blockchain, bruce-schneier

Feb. 13, 2019

django-zombodb (via) The hardest part of working with an external search engine like Elasticsearch is always keeping that index synchronized with your relational database. ZomboDB is a PostgreSQL extension which lets you create a new type of index backed by an external Elasticsearch cluster. Updated rows will be pushed to the index automatically, and custom SQL syntax can then be used to execute searches. django-zombodb is a brand new library by Flávio Juvenal which integrates ZomboDB directly into the Django ORM, letting you add Elasticsearch-backed functionality with just a few lines of extra configuration. It even includes custom Django migrations for enabling the extension in PostgreSQL!

# 10:14 pm / postgresql, django, elasticsearch

Feb. 14, 2019

Operations engineering does not consist of firefighting your shitty software, it is the science of delivering value to users.

Charity Majors

# 1:12 am / ops, charity-majors

Vitess (via) I remember looking at Vitess when it was first released by YouTube in 2012. The idea of a proven horizontally scalable sharding mechanism for MySQL was exciting, but I was put off by the need for a custom Go or Java client library. Apparently that changed with Vitess 2.1 in April 2017, the first version to introduce a MySQL protocol compatible proxy which can be connected to by existing code written in any language. Vitess 3.0 came out last December so now the MySQL proxy layer is much more stable. Vitess is used in production by a bunch of other companies now (including Slack and Square) so it’s definitely worth a closer look.

# 5:35 am / youtube, sharding, mysql, scaling, slack, vitess

Feb. 15, 2019

Data science is different now (via) Detailed examination of the current state of the job market for data science. Boot camps and university courses have produced a growing volume of junior data scientists seeking work, but the job market is much more competitive than many expected—especially for those without prior experience. Meanwhile the job itself is much more about data cleanup and software engineering skills: machine learning models and applied statistics end up being a small portion of the actual work.

# 3:36 pm / data-science

If you want the fastest website despite implementation difficulty, the answer is: SSR behind a CDN with assets in best compression formats (webp, Brotli, woff2) served over http2 (or 3) from same origin with JS as enhancement only

Mike Sherov

# 7:12 pm / cdn, http2, web-performance, javascript

Feb. 16, 2019

Discussion about Altavista on Hacker News. Fascinating thread on Hacker News where Bryant Durrell, a former Director from Altavista provides some insider thoughts on how they lost against Google.

# 6:57 pm / search, internet-history, google, computer-history

Feb. 18, 2019

This paper introduces Mesh, a plug-in replacement for malloc that, for the first time, eliminates fragmentation in unmodified C/C++ applications. Mesh combines novel randomized algorithms with widely-supported virtual memory operations to provably reduce fragmentation, breaking the classical Robson bounds with high probability. Mesh generally matches the runtime performance of state-of-the-art memory allocators while reducing memory consumption; in particular, it reduces the memory of consumption of Firefox by 16% and Redis by 39%.

Mesh: Compacting Memory Management for C/C++ Applications

# 3:26 pm / memory, redis, firefox

Feb. 19, 2019

The Eleven Laws of Showrunning (via) Fascinating essay on how to run a modern TV show by Javier Grillo-Marxuach. Being a showrunner basically involves running a 100+ person startup with a 7 digit budget, almost immovable deadlines, high maintenance activist investors and you're still expected to write some of the scripts! So many useful lessons here about management, creativity and delegation: almost everything in here is relevant to product management, startup founding and engineering management as well.

This one is the "nice" version - a not so nice version is available as well.

# 7:27 pm / management, showrunning

parameterized. I love the @parametrize decorator in pytest, which lets you run the same test multiple times against multiple parameters. The only catch is that the decorator in pytest doesn’t work for old-style unittest TestCase tests, which means you can’t easily add it to test suites that were built using the older model. I just found out about parameterized which works with unittest tests whether or not you are running them using the pytest test runner.

# 9:05 pm / testing, python, pytest

Lessons from 6 software rewrite stories (via) Herb Caudill takes on the classic idea that rewriting from scratch is “the single worst strategic mistake that any software company can make” and investigates it through the lens of six well-chosen examples: Netscape 6, Basecamp Classic/2/3, Visual Studio/VS Code, Gmail/Inbox, FogBugz/Wasabi/Trello, and finally FreshBooks/BillSpring. Each story has details I had never heard before, and the lessons and conclusions are deeply insightful.

# 9:55 pm / rewrites, product-management

Feb. 22, 2019

String length—Rosetta Code (via) Calculating the length of a string is surprisingly difficult once Unicode is involved. Here's a fascinating illustration of how that problem can be attached dozens of different programming languages. From that page: the string "J̲o̲s̲é̲" ("J\x{332}o\x{332}s\x{332}e\x{301}\x{332}") has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8.

# 3:27 pm / unicode, strings, programming-languages

Seeking the Productive Life: Some Details of My Personal Infrastructure (via) Stephen Wolfram’s 15,000 word epic about his personal approach to productivity, developed over the past thirty years. This is a fascinating document—I found myself thinking “surely there can’t be more information than this” and then spotting that the scrollbar wasn’t even a third done yet. Very hard to summarize: it turns out if you’re the work-from-home CEO of your own privately held 800 person company you can construct some very opinionated habits.

# 9:46 pm / productivity

Feb. 25, 2019

sqlite-utils: a Python library and CLI tool for building SQLite databases

sqlite-utils is a combination Python library and command-line tool I’ve been building over the past six months which aims to make creating new SQLite databases as quick and easy as possible.

[... 1,237 words]

In January, Facebook distributes a policy update stating that moderators should take into account recent romantic upheaval when evaluating posts that express hatred toward a gender. “I hate all men” has always violated the policy. But “I just broke up with my boyfriend, and I hate all men” no longer does.

Casey Newton

# 2:09 pm / facebook, moderation

My Twitter thread collecting behind the scenes content about Spider-Man: Into the Spider-Verse. I absolutely loved Spider-Verse, and I’ve been delighted to discover that many of the artists who created the movie are active on Twitter and have been posting all kinds of fascinating material about their creative process. I’ve been collecting examples in this Twitter thread for a couple of months now. They definitely deserved that Oscar.

# 2:57 pm / movies, twitter, spiderverse

huey. Charles Leifer’s “little task queue for Python”. Similar to Celery, but it’s designed to work with Redis, SQLite or in the parent process using background greenlets. Worth checking out for the really neat design. The project is new to me, but it’s been under active development since 2011 and has a very healthy looking rate of releases.

# 7:49 pm / sqlite, charles-leifer, python, queues, redis

Metrics are lossily compressed logs. Traces are logs with parent child relationships between entries. The only reason we have three terms is because getting value from them has required different compromises to make them cost effective.

Clint Sharp

# 10:15 pm / observability, logs

Feb. 26, 2019

Experiments, growth engineering, and exposing company secrets through your API (via) This is fun: Jon Luca observes that many companies that run A/B tests have private JSON APIs that list all of their ongoing experiments, and uses them to explore tests from Lyft, Airbnb, Pinterest, Amazon and more. Facebook and Instagram use SSL Stapling which makes it harder to spy on their mobile app traffic.

# 4:49 am / ab-testing, security

2019 » February

MTWTFSS
    123
45678910
11121314151617
18192021222324
25262728