Simon Willison’s Weblog

Subscribe

February 2018

Feb. 1, 2018

Observable notebook: San Francisco trees from Datasette. I used an Observable notebook to rebuild my San Francisco tree search demo against a Datasette API of a CSV of trees published by the SF Department of Public Works. The map updates live as you type a query, and every cell can be toggled to view the underlying source code.

# 12:37 am / observable, datasette

What we need to do is come up with a way to help people understand that there are ways to never be lost again, and to listen to any music you want, and to video chat with someone on the other side of the world, without them having to feel disquieted about it. That it's not OK that you're made to feel weirded out. That it's possible for there to be alternatives. That having to feel someone rooting around in your life is not a price you should have to pay.

Stuart Langridge

# 2:03 pm / stuart-langridge, privacy

Building a Full-Text Search App Using Docker and Elasticsearch. Deep, comprehensive tutorial from Patrick Triest showing how to use docker-compose to run three containers (Node API, nginx static content, elasticsearch) and then use that to build a neat Vue.js web search UI against 100 books from Project Gutenberg.

# 3:41 pm / docker, nodejs, elasticsearch

How the Citus distributed database rebalances your data. Citus is a fascinating implementation of database sharding built on top of PostgreSQL primitives. PostgreSQL 10 introduced extremely flexible logical replication—in this post Craig Kerstiens explains how Citus use this new ability to re-balance shards (e.g. when you move from two to four physical PostgreSQL nodes) without downtime.

# 10:50 pm / postgresql, architecture, sharding, craig-kerstiens, zero-downtime

Feb. 2, 2018

Using setup.py in Your (Django) Project. Includes this neat trick: if you list manage.py in the setup(scripts=) argument you can call it from e.g. cron using the full path to manage.py within your virtual environment and it will execute in the correct context without needing to explicitly activate the environment first.

# 12:33 pm / django, python

Family fun with deepfakes. Or how I got my wife onto the Tonight Show. deepfakes is dystopian nightmare technology: take a few thousand photos of two different people with similar shaped faces and you can produce an extremely realistic video where you swap one person’s face for the other. Unsurprisingly it’s being used for porn. This is a pleasantly SFW explanation of how it works, complete with a demo where Sven Charleer swaps his wife Elke for Anne Hathaway on the Tonight Show.

# 4:06 pm / computer-vision

Channels 2.0. Andrew just shipped Channels 2.0—a major rewrite and redesign of the Channels project he started back in 2014. Channels brings async to Django, providing a logical, standardized way of supporting things like WebSockets and asynchronous execution on top of a Django application. Previously it required you to run a separate Twisted server and redis/RabbitMQ queue, but thanks to Python 3 async everything can now be deployed as a single process. And the new ASGI spec means its turtles all the way down! Everything from URL routing to view functions to middleware can be composed together using the same ASGI interface.

# 6:19 pm / async, python3, django, python, andrew-godwin, websockets

asgiref: AsyncToSync and SyncToAsync (via) Andrew’s classes in asgiref that can turn a synchronous callable into an awaitable (that runs in a thread pool) or an awaitable callable into a synchronous callable, using Python 3 futures and asyncio.

# 7:06 pm / async, andrew-godwin, python3

Feb. 3, 2018

Just switched to {window.localStorage.getItem('debug') && <pre>{JSON.stringify(this.state, null, 2)}</pre>} - now I can ship to production and turn on debugging in my console with localStorage.setItem('debug', 1)

@simonw

# 5:23 am / debugging, react

How I made a Who’s On First subset database. Inspired by Paul Ford on Twitter, I tried out a new trick with SQLite: connect to a database containing JSON, attach a brand new empty database file using “attach database”, then populate it using INSERT INTO ... SELECT plus the json_extract() function to extract out a subset of the JSON properties into a new table in the new database.

# 5:25 am / whosonfirst, sqlite, json, paul-ford

Imagine a Simon Says style game where I present an article found on the web on a projector. Students research for two to three minutes, then respond by standing or staying seated to signal if they believe the article is true or fake. My students absolutely loved the game. Some refused to go to recess until I gave them another chance to figure out the next article I had queued.

Scott Bedley

# 1:29 pm / teaching

Conditional aggregation in Django 2.0 (via) I hadn’t realised how clever this new Django ORM feature by Tom Forbes is. It lets you build an aggregation against a subset of rows, e.g. Client.objects.aggregate(regular=Count(’pk’, filter=Q(account_type=Client.REGULAR)))—then if you are using PostgreSQL it translates it into a fast FILTER WHERE clause, while other databases emulate the same behaviour using a CASE statement.

# 9:38 pm / postgresql, django

Feb. 4, 2018

Owls Near Me. Back in 2010 Natalie and I shipped owlsnearyou.com—a website for finding your nearest owls, using data from the sadly deceased WildlifeNearYou (RIP). To celebrate #SuperbOwl Sunday we rebuilt the same concept on top of the excellent iNaturalist API. Search for a place to see which owls have been spotted there, or click the magic button to geolocate your device and see which owls have been spotted in your nearby area!

# 10:26 pm / wildlifenearyou, inaturalist, natalie-downe, projects

owlsnearme source code on GitHub. Here’s the source code for our new owlsnearme.com project. It’s a single-page React application that pulls all of its data from the iNaturalist API. We built it this weekend with the SuperbOwl kick-off as a hard deadline so it’s not the most beautiful React code, but it’s a nice demonstration of how React (and create-react-app in particular) can be used for rapid development.

# 10:33 pm / react, natalie-downe, javascript, projects, inaturalist, github

Feb. 7, 2018

Googlebot’s Javascript random() function is deterministic. random() as executed by Googlebot returns the same predicable sequence. More interestingly, Googlebot runs a much faster timer for setTimeout and setInterval—as Tom Anthony points out, “Why actually wait 5 seconds when you are a bot?”

# 2:41 am / seo, crawling, google

The whole story is basically that Facebook gets so much traffic that they started convincing publishers to post things on Facebook. For a long time, that was fine. People posted things on Facebook, then you would click those links and go to their websites. But then, gradually, Facebook started exerting more and more control of what was being seen, to the point that they, not our website, essentially became the main publishers of everyone’s content. Today, there’s no reason to go to a comedy website that has a video if that video is just right on Facebook. And that would be fine if Facebook compensated those companies for the ad revenue that was generated from those videos, but because Facebook does not pay publishers, there quickly became no money in making high-quality content for the internet.

Matt Klinman

# 3:51 pm / facebook

Feb. 20, 2018

Python & Async Simplified. Andrew Godwin: “Python’s async framework is actually relatively simple when you treat it at face value, but a lot of tutorials and documentation discuss it in minute implementation detail, so I wanted to make a higher-level overview that deliberately ignores some of the small facts and focuses on the practicalities of writing projects that mix both kinds of code.” ‪This is really useful: clearly explains the two separate worlds of Python (sync and async functions) and describes Andrew’s clever sync_to_async and async_to_sync decorators as well.‬

# 12:30 am / async, andrew-godwin, python, python3

Moving a large and old codebase to Python3 (via) Really interesting case study full of good ideas. The codebase in this case was 240,000 lines of Python and Django written over the course of 15 years. The team used Python-Modernize to aid their transition to a six-compatible codebase first.

# 2:39 pm / python, python3

Feb. 21, 2018

Photos from our tour of the amazing bone collection of Ray Bandar. Ray Bandar (1927-2017) was an artist, scientist, naturalist and an incredibly prolific collector of bones. His collection is in the process of moving to the California Academy of Sciences but Natalie managed to land us a private tour lead by his great nephew. The collection is truly awe-inspiring, and a testament to an extraordinary life lived following a very particular passion.

# 4:58 am / photos

Andrew Godwin’s www-router Docker container (via) Really clever Docker trick: a container that runs Nginx and uses it to route traffic to other containers based on the hostname—but the hostnames to be routed are configured using environment variables which the run-nginx.py CMD script uses to dynamically construct an nginx config when the container starts.

# 5:04 am / docker, andrew-godwin, nginx

A Promenade of PyTorch. Useful overview of the PyTorch machine learning library from Facebook AI Research described as “a Python library enabling GPU-accelerated tensor computation”. Similar to TensorFlow, but where TensorFlow requires you to explicitly construct an execution graph PyTorch instead lets you write regular Python code (if statements, for loops etc) which PyTorch then uses to construct the execution graph for you.

# 5:31 am / machine-learning, pytorch, tensorflow, python

s3monkey: A Python library that allows you to interact with Amazon S3 Buckets as if they are your local filesystem. (via) A particularly devious hack by Kenneth Reitz—provides a context manager within which various Python filesystem APIs such as open() and os.listdir() are monkeypatched to operate against an S3 bucket instead. Kenneth built it to make it easier to work with files from apps running on Heroku. Under the hood it uses pyfakefs, a filesystem mocking library originally released by Google.

# 5:54 pm / s3, monkeypatch, python, heroku

Feb. 22, 2018

I’ve Just Launched “Pwned Passwords” V2 With Half a Billion Passwords for Download (via) Troy Hunt has collected 501,636,842 passwords from a wide collection of major breaches. He suggests using the to build a password strength checker that can say “your password has been used by 53,274 other people”. The full collection is available as a list of SHA1 codes (brute-force reversible but at least slightly obfuscated) in an 8GB file or as an API. Where things get really clever is the API design: you send just the first 5 characters of the SHA1 hash of the user’s password and the API responds with the full list of several hundred hashes that match that prefix. This lets you build a checking feature without sharing full passwords with a remote service, if you don’t want to host the full 8GB of data yourself.

# 7:24 pm / passwords, security

Feb. 23, 2018

GitHub: Weak cryptographic standards removal notice. GitHub deprecated TLSv1 and TLSv1.1 yesterday. I like how they handled the deprecation: they disabled the protocols for one hour on February 8th in order to (hopefully) warm people by triggering errors in automated processes, then disabled them completely a couple of weeks later.

# 3:41 pm / security, github

I am pleased to inform all of you that there is a notorious black market maple syrup seller and he looks exactly like the image you get in your head when someone first says the phrase “a notorious black market maple syrup seller” to you.

Brian Grubb

# 3:56 pm / crime, brian-grubb

github-trending-repos (via) This is a really clever hack: Vitaliy Potapov built a system for subscribing to a weekly digest of trending GitHub repos in your favourite languages entirely on top of the existing GitHub issues notification system. Find the issue for your particular language and hit “subscribe” and you’ll get an email (or push notification depending on how you get your issue notifications) once a week with the latest trends. The implementation is a 220 line Node.js script which runs on a daily and weekly schedule using Circle CI, so Vitaliy doesn’t even have to host or pay for any of the underlying infrastructure. It’s brilliant.

# 5:36 pm / nodejs, github

Feb. 25, 2018

Publishing history has various examples of advertising-only business models. But they are very much the exception. They mainly exist when there are near monopoly barriers to entry into the market which allow publishers to command and defend robust ad rates.

Josh Marshall

# 4:03 pm / advertising

kennethreitz/requests-html: HTML Parsing for Humans™ (via) Neat and tiny wrapper around requests, lxml and html2text that provides a Kenneth Reitz grade API design for intuitively fetching and scraping web pages. The inclusion of html2text means you can use a CSS selector to select a specific HTML element and then convert that to the equivalent markdown in a one-liner.

# 4:49 pm / scraping, html, python, requests

r1chardj0n3s/parse: Parse strings using a specification based on the Python format() syntax. (via) Really neat API design: parse() behaves almost exactly in the opposite way to Python’s built-in format(), so you can use format strings as an alternative to regular expressions for extracting specific data from a string.

# 4:58 pm / regular-expressions, python

Feb. 26, 2018

By far the most important lesson I took out of this game is that whenever there's behavior that needs to be repeated around to multiple types of entities, it's better to default to copypasting it than to abstracting/generalizing it too early.

This is a very very hard thing to do in practice. As programmers we're sort of wired to see repetition and want to get rid of it as fast as possible, but I've found that that impulse generally creates more problems than it solves. The main problem it creates is that early generalizations are often wrong, and when a generalization is wrong it ossifies the structure of the code around it in a way that is harder to fix and change than if it wasn't there in the first place.

SSYGEN

# 5:23 am / programming

2018 » February

MTWTFSS
   1234
567891011
12131415161718
19202122232425
262728