A CSV export, JSON import workflow for bulk updating our data
28th April 2021
Originally posted to my internal blog at VaccinateCA
I just added missing counties to around 1200 of our locations using a combination of tricks, and I thought they’d make a good blog post.
County is an optional field on our location model, and we sometimes import new locations without attaching a county to them.
How big is the problem?
I ran this SQL query in our dashboard to figure out how bad the problem was:
select
state.name,
count(*) as number_of_locations_with_no_county
from
location join state on state.id = location.state_id
where
location.county_id is null
group by
state.name
I got back this:
name number_of_locations_with_no_county
Arkansas 1
Arizona 90
California 1
Oregon 1113
I decided to start with Oregon, since I knew that was my fault (I imported a bunch of data for Oregon over a month ago and clearly didn’t bother with counties).
Exporting the bad data as CSV
I used another dashboard query to export the location public ID, latitude and longitude for every location in Oregon that was missing a county:
select
public_id, latitude, longitude
from
location
where
state_id = (
select id from state where name = 'Oregon'
)
and county_id is null
I exported those results as CSV, then switched into a Jupyter notebook running on my laptop.
The full script
Here’s the script I ran there:
import csv
import httpx
import json
rows = list(csv.DictReader(open("select-public-id--latitude--lof05de6.csv")))
def county_for_point(latitude, longitude):
return httpx.get("https://us-counties.datasette.io/counties/county_for_latitude_longitude.json?_shape=array", params={
"latitude": latitude,
"longitude": longitude,
}).json()[0]['county_name']
counties = {}
# Progress bar:
from tqdm.notebook import tqdm
for row in tqdm(rows):
counties[row["public_id"]] = county_for_point(
row["latitude"], row["longitude"]
)
update = {
"update": {
key: {
"state": "OR",
"county": value
} for key, value in counties.items()
},
"revision_comment": "Added missing counties"
}
print(json.dumps(update))
Finding the county for a latitude/longitude point
The county_for_point()
function uses my us-counties.datasette.io API, which provides an API for looking up a county based on a latitude and longitude based on US Census shapefiles—source code for that is at simonw/us-counties-datasette.
You can try the API out here: https://us-counties.datasette.io/counties/county_for_latitude_longitude?longitude=-122.676968&latitude=45.266289
Add .json
and ?_shape=array
to get the result as a JSON array:
[
{
"state_fips": "41",
"state": "OR",
"county_fips": "41005",
"county_name": "Clackamas",
"COUNTYNS": "01155127",
"AFFGEOID": "0500000US41005",
"GEOID": "41005",
"LSAD": "06",
"ALAND": 4845034547,
"AWATER": 31873103
}
]
Progress bars in Jupyter
I used tqdm to display a progress bar inside my Jupyter notebook. It makes it easy to wrap an iterable Python object in a way that shows a progress bar as it processes the list:
from tqdm.notebook import tqdm
for row in tqdm(rows):
counties[row["public_id"]] = county_for_point(
row["latitude"], row["longitude"]
)
Building JSON to send to our API
The /api/updateLocations
API is documented here. It’s a bulk update API which accepts a JSON body looking something like this:
{
"update": {
"rec9Zc6A08cEWyNpR": {
"name": "Berkeley Clinic II"
},
"lgzgq": {
"phone_number": "(555) 555-5551"
}
},
"revision_comment": "New details"
}
It lets you apply a partial update to a whole list of locations at once.
In our case we only want to update the county (though we need to send the state too in order to correctly validate that the county name belongs to that state). My Python code to generate the update document looks like this:
update = {
"update": {
key: {
"state": "OR",
"county": value
} for key, value in counties.items()
},
"revision_comment": "Added missing counties"
}
print(json.dumps(update))
A pretty-printed subset of the output looks like this:
{
"update": {
"lrzkd": {
"state": "AZ",
"county": "Maricopa"
},
"lrzmg": {
"state": "AZ",
"county": "Pima"
},
"lrzmr": {
"state": "AZ",
"county": "Cochise"
}
},
"revision_comment": "Added missing counties"
}
Pasting that into our API explorer
Having generated this API update document, I used our interactive API explorer tool to execute the update:
https://vial.calltheshots.us/api/updateLocations/debug
Almost all of our API endpoints have an interface like this, which is mainly intended for trying things out but also works for making ad-hoc requests like this one.
Tying it together with a saved dashboard
I built a saved dashboard (using the VIAL admin) called locations-with-no-county
to tie all of this together.
Initially I thought the dashboard could be used by our web bankers to find locations missing a county—so I had it return ten random locations without a county, so multiple web bankers could use it at the same time with a low risk of overlapping efforts.
Then I realized I could automate it myself, so that aspect of the dashboard turned out not to be useful—though we may well use it for smaller fixing jobs in the future.
https://vial.calltheshots.us/dashboard/locations-with-no-county/
Doing this by generating JSON entirely inside PostgreSQL
I used another variant of this trick last week to add concordance identifiers to our locations, using the /api/updateConcordanceLocations API.
I wanted to add a walgreens:12134
concordance identifier to every one of our locations that was a Walgreens. You can follow my progress in this issue—I used this SQL query to identify all of our Walgreens locations by executing a PostgreSQL regular expression against the name
column, then generate the API update JSON using PostgreSQL’s JSON functions:
with extracted_idrefs as (
select
public_id, name,
(regexp_match(name, 'Walgreens.*#0*([0-9]{1,8})', 'i')) [1] as idref
from
location
where
regexp_match(name, 'Walgreens.*#0*([0-9]{1,8})', 'i') is not null
)
select
json_build_object(
'update',
json_object_agg(
public_id,
json_build_object('add', json_build_array('walgreens:' || idref), 'comment', name)
)
)
from
extracted_idrefs
where
public_id not in (
select
public_id
from
location
where
id in (
select
location_id
from
concordance_location
where
concordanceidentifier_id in (
select
id
from
concordance_identifier
where
authority = 'walgreens'
)
)
)
That last where
clause avoids generating JSON for locations that already have a walgreens:X
concordance identifier.
The output of the query looks something like this:
{
"update": {
"lrxzc": {
"add": [
"walgreens:10076"
],
"comment": "Walgreens Co. #10076"
},
"lrxzd": {
"add": [
"walgreens:10082"
],
"comment": "Walgreens Co. #10082"
},
"lrxzf": {
"add": [
"walgreens:10160"
],
"comment": "Walgreens Co. #10160"
}
}
}
As before, this can be executed directly using this API debugging tool: https://vial.calltheshots.us/api/updateLocationConcordances/debug
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024