Simon Willison’s Weblog

Subscribe
Atom feed for screenscraping

7 items tagged “screenscraping”

2017

Changelogs to help understand the fires in the North Bay

The situation in the counties north of San Francisco is horrifying right now. I’ve repurposed some of the tools I built to for the Irma Response project last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.

[... 383 words]

Scraping hurricane Irma

The Irma Response project is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The Irma API is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the irmashelters.org website.

[... 438 words]

2009

Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.

# 24th January 2009, 11:52 pm / rdf, xml, screenscraping, gecko, xulrunner, mozilla, dom, greasemonkey, webservice, crowbar

2008

YQL—converting the web to JSON with mock SQL. YQL just got a whole lot more interesting to me—I had no idea they were exposing an HTML and RSS scraping tool over a JSONP API in addition to all of the Yahoo! web service methods.

# 13th December 2008, 9:39 am / yql, scraping, json, yahoo, html, screenscraping, jsonp, sql

lxml: an underappreciated web scraping library. I just wish I could get the wretched thing to install on OS X Leopard without resorting to MacPorts.

# 11th December 2008, 9:54 am / lxml, macports, python, screenscraping, ian-bicking

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.

# 3rd August 2008, 3:29 pm / pdf, python, xml, screenscraping, pdfminer

2007

/trunk/jl/scraper. journa-list.com is open source, and the screen scrapers are written in Python.

# 11th October 2007, 4:10 pm / python, open-source, journalist, screenscraping