Simon Willison’s Weblog

Subscribe

Using “import refs” to iteratively import data into Django

4th November 2017

I’ve been writing a few scripts to backfill my blog with content I originally posted elsewhere. So far I’ve imported answers I posted on Quora (background), answers I posted on Ask MetaFilter and content I recovered from the Internet Archive.

I started out writing custom import scripts (like this Quora one), but I’ve now built a generalized mechanism for this which I thought was worth writing up.

Any of my content imports now take the form of a JSON document, which looks something like this:

[
  {
    "body": "<p><em>My answer to ...</em></p>",
    "tags": [
      "backpacks",
      "laptops",
      "style",
      "accessories",
      "bags"
    ],
    "title": "I need a new backpack",
    "datetime": "2005-01-16T14:08:00",
    "import_ref": "askmetafilter:14075",
    "type": "entry",
    "slug": "i-need-a-new-backpack"
  }
]

Two larger examples: the missing content I extracted from the Internet Archive, and the answers I scraped from Ask MetaFilter.

The type property can be set to entry, quotation or blogmark and specifies which type of content should be imported. The datetime, slug and tags fields are common across all three types—the other fields differ for each type.

The most interesting field here is import_ref. This is optional, but if provided forms a unique reference associated with that item of content. I then use that reference in a call Django’s update_or_create() method. This means I can run the same import multiple times—the first run will create objects, while subsequent runs update objects in place.

The end result is that I can incrementally improve the scrapers I am writing, re-importing the resulting JSON to update previously imported records in-place. In addition to hacking on my blog, I’ve been using this pattern for some API integrations at work recently and it’s worked out very well.

import_ref is defined on my models as a unique, nullable text field:

    import_ref = models.TextField(max_length=64, null=True, unique=True)

Since the Django admin doesn’t handle nullable fields well by default, I added import_ref to my readonly_fields property in my admin configuration to avoid accidentally setting it to a blank string when editing through the admin interface.

Here’s my completed import_blog_json management command.

My workflow for importing data is now pretty streamlined. I write the scrapers in a Juyter notebook and use that to generate a list of importable items as Python dictionaries. I run open('/tmp/items.json').write(json.dumps(items, indent=2)) to dump the items to a JSON file. Then I can run ./manage.py import_blog_json /tmp/items.json to import them into my local development environment—thanks to the import_ref I can do this as many times as I like until I’m pleased with the result.

Once it’s ready, I run !cat /tmp/blah.json | pbcopy in Jupyter to copy the JSON to my clipboard, then paste the JSON into a new GitHub Gist. I then copy the URL to that raw JSON and execute it against my production instance.

Heroku tip: running heroku run bash will start a bash prompt in a dyno hooked up to your application. You can then run ./manage.py ... commands against your production environment.

So… I just have to run heroku run bash followed by ./manage.py import_blog_json https://gist.github.com/path-to-json --tag_with=askmetafilter and the new content will be live on my site.

The tag_with option allows me to specify a tag to apply to all of that imported content, useful for checking that everything worked as expected.

This is Using “import refs” to iteratively import data into Django by Simon Willison, posted on 4th November 2017.

Next: Running a load testing Go utility using Docker for Mac

Previous: Late night dining near Great American Music Hall