7 items tagged “screenscraping”
2017
Changelogs to help understand the fires in the North Bay
The situation in the counties north of San Francisco is horrifying right now. I’ve repurposed some of the tools I built to for the Irma Response project last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.
[... 383 words]Scraping hurricane Irma
The Irma Response project is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The Irma API is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the irmashelters.org website.
[... 438 words]2009
Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.
2008
YQL—converting the web to JSON with mock SQL. YQL just got a whole lot more interesting to me—I had no idea they were exposing an HTML and RSS scraping tool over a JSONP API in addition to all of the Yahoo! web service methods.
lxml: an underappreciated web scraping library. I just wish I could get the wretched thing to install on OS X Leopard without resorting to MacPorts.
PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.
2007
/trunk/jl/scraper. journa-list.com is open source, and the screen scrapers are written in Python.