Simon Willison’s Weblog

Subscribe
Atom feed for datasette-lite

18 items tagged “datasette-lite”

Datasette running entirely in the browser using Pyodide and WebAssembly, at lite.datasette.io

2025

URL-addressable Pyodide Python environments

Visit URL-addressable Pyodide Python environments

This evening I spotted an obscure bug in Datasette, using Datasette Lite. I figure it’s a good opportunity to highlight how useful it is to have a URL-addressable Python environment, powered by Pyodide and WebAssembly.

[... 1,905 words]

S1: The $6 R1 Competitor? Tim Kellogg shares his notes on a new paper, s1: Simple test-time scaling, which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for 26 minutes on 16 NVIDIA H100 GPUs.

Tim highlight the most exciting result:

After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that's needed to achieve o1-preview performance on a 32B model.

The paper describes a technique called "Budget forcing":

To enforce a minimum, we suppress the generation of the end-of-thinking token delimiter and optionally append the string “Wait” to the model’s current reasoning trace to encourage the model to reflect on its current generation

That's the same trick Theia Vogel described a few weeks ago.

Here's the s1-32B model on Hugging Face. I found a GGUF version of it at brittlewis12/s1-32B-GGUF, which I ran using Ollama like so:

ollama run hf.co/brittlewis12/s1-32B-GGUF:Q4_0

I also found those 1,000 samples on Hugging Face in the simplescaling/s1K data repository there.

I used DuckDB to convert the parquet file to CSV (and turn one VARCHAR[] column into JSON):

COPY (
    SELECT 
      solution,
      question,
      cot_type,
      source_type,
      metadata,
      cot,
      json_array(thinking_trajectories) as thinking_trajectories,
      attempt
    FROM 's1k-00001.parquet'
) TO 'output.csv' (HEADER, DELIMITER ',');

Then I loaded that CSV into sqlite-utils so I could use the convert command to turn a Python data structure into JSON using json.dumps() and eval():

# Load into SQLite
sqlite-utils insert s1k.db s1k output.csv --csv
# Fix that column
sqlite-utils convert s1k.db s1u metadata 'json.dumps(eval(value))' --import json
# Dump that back out to CSV
sqlite-utils rows s1k.db s1k --csv > s1k.csv

Here's that CSV in a Gist, which means I can load it into Datasette Lite.

Screenshot of Datasette Lite showing cot_type 1, crossword 15, Link: 93, rowid: 93, solution: "### Answer: INCIDENT ROOM ### Explanation: Definition: investigators' facility **Anagram of**(... changes) NOTICED MINOR. Defn: ... in a police station." Question text: "Solve the crossword puzzle. You are presented with a clue as input and the number of letters in brackets." Clue: "Noticed minor changes in investigators' facility (8,4)" cot_type: crossword, source_type: 0xharib/xword1, metadata: { "instruction": "You are an expert level solver of cryptic crosswords. You are presented with a clue as input. Respond with the answer and explanation." }

It really is a tiny amount of training data. It's mostly math and science, but there are also 15 cryptic crossword examples.

# 5th February 2025, 8 pm / duckdb, datasette-lite, inference-scaling, ai, ollama, llms, datasette, generative-ai, qwen

2024

Foursquare Open Source Places: A new foundational dataset for the geospatial community (via) I did not expect this!

[...] we are announcing today the general availability of a foundational open data set, Foursquare Open Source Places ("FSQ OS Places"). This base layer of 100mm+ global places of interest ("POI") includes 22 core attributes (see schema here) that will be updated monthly and available for commercial use under the Apache 2.0 license framework.

The data is available as Parquet files hosted on Amazon S3.

Here's how to list the available files:

aws s3 ls s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/

I got back places-00000.snappy.parquet through places-00024.snappy.parquet, each file around 455MB for a total of 10.6GB of data.

I ran duckdb and then used DuckDB's ability to remotely query Parquet on S3 to explore the data a bit more without downloading it to my laptop first:

select count(*) from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet';

This got back 4,180,424 - that number is similar for each file, suggesting around 104,000,000 records total.

Update: DuckDB can use wildcards in S3 paths (thanks, Paul) so this query provides an exact count:

select count(*) from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-*.snappy.parquet';

That returned 104,511,073 - and Activity Monitor on my Mac confirmed that DuckDB only needed to fetch 1.2MB of data to answer that query.

I ran this query to retrieve 1,000 places from that first file as newline-delimited JSON:

copy (
    select * from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet'
    limit 1000
) to '/tmp/places.json';

Here's that places.json file, and here it is imported into Datasette Lite.

Finally, I got ChatGPT Code Interpreter to convert that file to GeoJSON and pasted the result into this Gist, giving me a map of those thousand places (because Gists automatically render GeoJSON):

A map of the world with 1000 markers on it. A marker in Columbia shows a dialog for Raisbeck, Bogota Dv, Cra 47 A 114 05 Second Floor

# 20th November 2024, 5:52 am / open-source, gis, foursquare, datasette-lite, parquet, duckdb, code-interpreter, ai-assisted-programming, geojson

2023

Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking

I’ve fallen a bit behind on my weeknotes. Here’s a catchup for the last few weeks.

[... 769 words]

MMS Language Coverage in Datasette Lite. I converted the HTML table of 4,021 languages supported by Meta’s new Massively Multilingual Speech models to newline-delimited JSON and loaded it into Datasette Lite. Faceting by Language Family is particularly interesting—the top five families represented are Niger-Congo with 1,019, Austronesian with 609, Sino-Tibetan with 288, Indo-European with 278 and Afro-Asiatic with 222.

# 22nd May 2023, 8:01 pm / facebook, datasette-lite, ai, datasette

Data analysis with SQLite and Python for PyCon 2023

Visit Data analysis with SQLite and Python for PyCon 2023

I’m at PyCon 2023 in Salt Lake City this week.

[... 347 words]

What’s in the RedPajama-Data-1T LLM training set

Visit What's in the RedPajama-Data-1T LLM training set

RedPajama is “a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens”. It’s a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.

[... 1,077 words]

2022

Tracking Mastodon user numbers over time with a bucket of tricks

Visit Tracking Mastodon user numbers over time with a bucket of tricks

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.

[... 1,534 words]

Datasette Lite: Loading JSON data (via) I added a new feature to Datasette Lite: you can now pass it the URL to a JSON file (hosted on a CORS-compatible hosting provider such as GitHub or GitHub Gists) and it will load that file into a database table for you. It expects an array of objects, but if your file has an object as the root it will search through it looking for the first key that is an array of objects and load those instead.

# 18th November 2022, 6:43 pm / datasette-lite, json, projects, datasette, cors

Measuring traffic during the Half Moon Bay Pumpkin Festival

Visit Measuring traffic during the Half Moon Bay Pumpkin Festival

This weekend was the 50th annual Half Moon Bay Pumpkin Festival.

[... 2,693 words]

Weeknotes: Datasette Lite, s3-credentials, shot-scraper, datasette-edit-templates and more

Visit Weeknotes: Datasette Lite, s3-credentials, shot-scraper, datasette-edit-templates and more

Despite distractions from AI I managed to make progress on a bunch of different projects this week, including new releases of s3-credentials and shot-scraper, a new datasette-edit-templates plugin and a small but neat improvement to Datasette Lite.

[... 1,562 words]

Open every CSV file in a GitHub repository in Datasette Lite (via) I built an Observable notebook that accepts a GitHub repository as input, scans it for CSV files and generates a link to open all of those CSV files in Datasette Lite.

# 1st September 2022, 7:24 pm / datasette-lite, projects, observable, github

Analyzing ScotRail audio announcements with Datasette—from prototype to production

Visit Analyzing ScotRail audio announcements with Datasette - from prototype to production

Scottish train operator ScotRail released a two-hour long MP3 file containing all of the components of its automated station announcements. Messing around with them is proving to be a huge amount of fun.

[... 4,428 words]

Plugin support for Datasette Lite

Visit Plugin support for Datasette Lite

I’ve added a new feature to Datasette Lite, my distribution of Datasette that runs entirely in the browser using Python and SQLite compiled to WebAssembly. You can now install additional Datasette plugins by passing them in the URL.

[... 865 words]

Weeknotes: Joining the board of the Python Software Foundation

Visit Weeknotes: Joining the board of the Python Software Foundation

A few weeks ago I was elected to the board of directors for the Python Software Foundation.

[... 2,081 words]

Joining CSV files in your browser using Datasette Lite

Visit Joining CSV files in your browser using Datasette Lite

I added a new feature to Datasette Lite—my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them—including joins across data from multiple files.

[... 546 words]

Weeknotes: Datasette Lite, nogil Python, HYTRADBOI

My big project this week was Datasette Lite, a new way to run Datasette directly in a browser, powered by WebAssembly and Pyodide. I also continued my research into running SQL queries in parallel, described last week. Plus I spoke at HYTRADBOI.

[... 1,434 words]

Datasette Lite: a server-side Python web application running in a browser

Visit Datasette Lite: a server-side Python web application running in a browser

Datasette Lite is a new way to run Datasette: entirely in a browser, taking advantage of the incredible Pyodide project which provides Python compiled to WebAssembly plus a whole suite of useful extras.

[... 4,800 words]