Simon Willison on xml

105 posts tagged “xml”

2025

Removing XSLT for a more secure browser (via) Previously discussed back in August, it looks like it's now official:

Chrome intends to deprecate and remove XSLT from the browser. [...] We intend to remove support from version 155 (November 17, 2026). The Firefox and WebKit projects have also indicated plans to remove XSLT from their browser engines. [...]

The continued inclusion of XSLT 1.0 in web browsers presents a significant and unnecessary security risk. The underlying libraries that process these transformations, such as libxslt (used by Chromium browsers), are complex, aging C/C++ codebases. This type of code is notoriously susceptible to memory safety vulnerabilities like buffer overflows, which can lead to arbitrary code execution.

I mostly encounter XSLT on people's Atom/RSS feeds, converting those to a more readable format in case someone should navigate directly to that link. Jake Archibald shared an alternative solution to that back in September.

# 5th November 2025, 10:24 pm / browsers, chrome, security, web-standards, xml, xslt, jake-archibald

Making XML human-readable without XSLT. In response to the recent discourse about XSLT support in browsers, Jake Archibald shares a new-to-me alternative trick for making an XML document readable in a browser: adding the following element near the top of the XML:

<script
  xmlns="http://www.w3.org/1999/xhtml"
  src="script.js" defer="" />

That script.js will then be executed by the browser, and can swap out the XML with HTML by creating new elements using the correct namespace:

const htmlEl = document.createElementNS(
  'http://www.w3.org/1999/xhtml',
  'html',
);
document.documentElement.replaceWith(htmlEl);
// Now populate the new DOM

# 2nd September 2025, 7:32 pm / browsers, javascript, rss, xml, xslt, jake-archibald

My First Open Source AI Generated Library (via) Armin Ronacher had Claude and Claude Code do almost all of the work in building, testing, packaging and publishing a new Python library based on his design:

It wrote ~1100 lines of code for the parser

It wrote ~1000 lines of tests

It configured the entire Python package, CI, PyPI publishing

Generated a README, drafted a changelog, designed a logo, made it theme-aware

Did multiple refactorings to make me happier

The project? sloppy-xml-py, a lax XML parser (and violation of everything the XML Working Group hold sacred) which ironically is necessary because LLMs themselves frequently output "XML" that includes validation errors.

Claude's SVG logo design is actually pretty decent, turns out it can draw more than just bad pelicans!

Hand drawn style, orange rough rectangly containing < { s } > - then the text Sloppy XML below in black

I think experiments like this are a really valuable way to explore the capabilities of these models. Armin's conclusion:

This was an experiment to see how far I could get with minimal manual effort, and to unstick myself from an annoying blocker. The result is good enough for my immediate use case and I also felt good enough to publish it to PyPI in case someone else has the same problem.

Treat it as a curious side project which says more about what's possible today than what's necessarily advisable.

I'd like to present a slightly different conclusion here. The most interesting thing about this project is that the code is good.

My criteria for good code these days is the following:

Solves a defined problem, well enough that I'm not tempted to solve it in a different way
Uses minimal dependencies
Clear and easy to understand
Well tested, with tests prove that the code does what it's meant to do
Comprehensive documentation
Packaged and published in a way that makes it convenient for me to use
Designed to be easy to maintain and make changes in the future

sloppy-xml-py fits all of those criteria. It's useful, well defined, the code is readable with just about the right level of comments, everything is tested, the documentation explains everything I need to know, and it's been shipped to PyPI.

I'd be proud to have written this myself.

This example is not an argument for replacing programmers with LLMs. The code is good because Armin is an expert programmer who stayed in full control throughout the process. As I wrote the other day, a skilled individual with both deep domain understanding and deep understanding of the capabilities of the agent.

# 21st June 2025, 11:22 pm / armin-ronacher, open-source, pypi, python, xml, ai, generative-ai, llms, ai-assisted-programming, claude, claude-code

Cracking The Dave & Buster’s Anomaly. Guilherme Rambo reports on a weird iOS messages bug:

The bug is that, if you try to send an audio message using the Messages app to someone who’s also using the Messages app, and that message happens to include the name “Dave and Buster’s”, the message will never be received.

Guilherme captured the logs from an affected device and spotted an XHTMLParseFailure error.

It turned out the iOS automatic transcription mechanism was recognizing the brand name and converting it to the official restaurant chain's preferred spelling "Dave & Buster’s"... which was then incorrectly escaped and triggered a parse error!

# 5th June 2025, 10:23 am / xhtml, xml, ios

2024

Anthropic’s Prompt Engineering Interactive Tutorial (via) Anthropic continue their trend of offering the best documentation of any of the leading LLM vendors. This tutorial is delivered as a set of Jupyter notebooks - I used it as an excuse to try uvx like this:

git clone https://github.com/anthropics/courses
uvx --from jupyter-core jupyter notebook courses

This installed a working Jupyter system, started the server and launched my browser within a few seconds.

The first few chapters are pretty basic, demonstrating simple prompts run through the Anthropic API. I used %pip install anthropic instead of !pip install anthropic to make sure the package was installed in the correct virtual environment, then filed an issue and a PR.

One new-to-me trick: in the first chapter the tutorial suggests running this:

API_KEY = "your_api_key_here"
%store API_KEY

This stashes your Anthropic API key in the IPython store. In subsequent notebooks you can restore the API_KEY variable like this:

%store -r API_KEY

I poked around and on macOS those variables are stored in files of the same name in ~/.ipython/profile_default/db/autorestore.

Chapter 4: Separating Data and Instructions included some interesting notes on Claude's support for content wrapped in XML-tag-style delimiters:

Note: While Claude can recognize and work with a wide range of separators and delimeters, we recommend that you use specifically XML tags as separators for Claude, as Claude was trained specifically to recognize XML tags as a prompt organizing mechanism. Outside of function calling, there are no special sauce XML tags that Claude has been trained on that you should use to maximally boost your performance. We have purposefully made Claude very malleable and customizable this way.

Plus this note on the importance of avoiding typos, with a nod back to the problem of sandbagging where models match their intelligence and tone to that of their prompts:

This is an important lesson about prompting: small details matter! It's always worth it to scrub your prompts for typos and grammatical errors. Claude is sensitive to patterns (in its early years, before finetuning, it was a raw text-prediction tool), and it's more likely to make mistakes when you make mistakes, smarter when you sound smart, sillier when you sound silly, and so on.

Chapter 5: Formatting Output and Speaking for Claude includes notes on one of Claude's most interesting features: prefill, where you can tell it how to start its response:

client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    messages=[
        {"role": "user", "content": "JSON facts about cats"},
        {"role": "assistant", "content": "{"}
    ]
)

Things start to get really interesting in Chapter 6: Precognition (Thinking Step by Step), which suggests using XML tags to help the model consider different arguments prior to generating a final answer:

Is this review sentiment positive or negative? First, write the best arguments for each side in <positive-argument> and <negative-argument> XML tags, then answer.

The tags make it easy to strip out the "thinking out loud" portions of the response.

It also warns about Claude's sensitivity to ordering. If you give Claude two options (e.g. for sentiment analysis):

In most situations (but not all, confusingly enough), Claude is more likely to choose the second of two options, possibly because in its training data from the web, second options were more likely to be correct.

This effect can be reduced using the thinking out loud / brainstorming prompting techniques.

A related tip is proposed in Chapter 8: Avoiding Hallucinations:

How do we fix this? Well, a great way to reduce hallucinations on long documents is to make Claude gather evidence first.

In this case, we tell Claude to first extract relevant quotes, then base its answer on those quotes. Telling Claude to do so here makes it correctly notice that the quote does not answer the question.

I really like the example prompt they provide here, for answering complex questions against a long document:

<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

Please read the below document. Then, in <scratchpad> tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail. Then write a brief numerical answer in <answer> tags.

# 30th August 2024, 2:52 am / python, xml, ai, jupyter, prompt-engineering, generative-ai, llms, anthropic, claude, uv

2022

SIARD: Software Independent Archiving of Relational Databases (via) I hadn’t heard of this before but it looks really interesting: the Federal Archives of Switzerland developed a standard for archiving any relational database as a zip file full of XML which is “is used in over 50 countries around the globe”.

# 4th May 2022, 10:40 pm / archives, databases, xml

2020

Building an Evernote to SQLite exporter

I’ve been using Evernote for over a decade, and I’ve long wanted to export my data from it so I can do interesting things with it.

[... 1,879 words]

8:12 pm / 16th October 2020 / projects, sqlite, xml, datasette, dogsheep, sqlite-utils

xml-analyser. In building evernote-to-sqlite I dusted off an ancient (2009) project I built that scans through an XML file and provides a summary of what elements are present in the document and how they relate to each other. I’ve now packaged it up as a CLI app and published it on PyPI.

# 12th October 2020, 12:41 am / cli, projects, xml

2019

Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite. This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser.

# 24th July 2019, 8:25 am / elementtree, memory, profiling, python, xml

Convert Locations.kml (pulled from an iPhone backup) to SQLite. I’ve been playing around with data from my iPhone using the iPhone Backup Extractor app and one of the things it exports for you is a Locations.kml file full of location history data. I wrote a tiny script using Python’s ElementTree XMLPullParser to efficiently iterate through the Placemarks and yield them as dictionaries, which I then batch-inserted into sqlite-utils to create a SQLite database.

# 14th June 2019, 12:45 am / kml, projects, sqlite, xml, sqlite-utils

2018

Exploring the UK Register of Members Interests with SQL and Datasette

Ever wondered which UK Members of Parliament get gifted the most helicopter rides? How about which MPs have been given Christmas hampers by the Sultan of Brunei? (David Cameron, William Hague and Michael Howard apparently). Here’s how to dig through the Register of Members Interests using SQL and Datasette.

[... 1,167 words]

3:49 pm / 25th April 2018 / mysociety, political-hacking, politics, projects, sqlite, xml, datasette

2012

Has JSON pretty much replaced XML for string processing for the web, or are there use cases where XML is still necessary?

It’s replaced XML as the default format for most APIs. XML is still necessary for Atom/RSS feeds and other existing standards built on top of XML. It’s also a better choice than JSON for markup-style data—stuff like XHTML where tags are applied to sequences of characters within larger chunks of text.

[... 81 words]

5:17 pm / 25th February 2012 / json, web-development, xml, quora

What are XML feed best practices?

It sounds like you’re pretty much screwed already, if you’re dealing with companies that still think FTPing XML around is a sensible thing to do.

[... 364 words]

2:29 pm / 31st January 2012 / databases, mysql, php, xml, quora

What is the difference between XHTML 1.0 strict and transitional?

Not a lot. XHTML transitional lets you use a few presentational attributes and elements that aren’t available in XHTML strict. Here’s a more detailed overview from back in 2005: http://24ways.org/2005/transitio...

[... 59 words]

1:09 pm / 14th January 2012 / html, web-development, xhtml, xml, quora

2010

Indexing JSON in Solr 3.1. The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML.

# 10th December 2010, 9:46 am / json, search, solr, xml, recovered

I think the Web community has spoken, and it’s clear that what it wants is HTML5, JavaScript and JSON. XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

— James Clark

# 2nd December 2010, 6:48 pm / html5, json, xml, recovered

2009

Introducing BERT and BERT-RPC. Justification for inventing a brand new serialisation protocol: Thrift and Protocol Buffers both use IDLs and code generation, XML “is not convertible to a simple unambiguous data structure in any language I’ve ever used” and JSON lacks support for unencoded binary data. The result is BERT—Binary ERlang Term—which extracts a format from Erlang in much the same way that JSON extracted one from JavaScript.

# 21st October 2009, 10:11 pm / erlang, github, javascript, json, protocolbuffers, serialisation, thrift, xml

minixsv (via) As far as I can tell, this is the only library that can validate XML using pure Python (no C extension required). I’d be extremely happy if someone would write a pure Python library (or one that only depends on ElementTree, which is included in the standard library) for validating XML against a Relax NG Compact syntax schema. Even DTD validation would be better than nothing!

# 12th August 2009, 4:59 pm / elementtree, minixsv, python, relaxng, validation, xml, xmlschema

xmlwitch. An XML building library for Python that doesn’t suck (I love ElementTree for parsing XML, but I’ve never really liked it for generation). Makes smart use of the with statement.

# 24th July 2009, 12:33 am / python, withstatement, xml, xmlwitch

There are two meanings to XHTML: technical and marketing. The technical kind (XHTML served using the application/xhtml xml MIME type) is a formulation of HTML as an XML vocabulary. The marketing kind (XHTML served using the text/html MIME type) is processed just like HTML by browsers but the authors attempt to observe slightly different syntax rules in order to make it seem that they are doing something newer and shinier compared to HTML.

— Henri Sivonen

# 6th July 2009, 12:46 pm / buzzwords, henri-sivonen, xhtml, xml

With YQL Execute, the Internet becomes your database. This is nuts (in a good way). Yahoo!’s intriguing universal SQL-style XML/JSONP web service interface now supports JavaScript as a kind of stored procedure language, meaning you can use JavaScript and E4X to screen-scrape web pages, then query the results with YQL.

# 29th April 2009, 10:50 pm / apis, e4x, javascript, json, jsonp, sql, xml, yahoo, yql

A few notes on the Guardian Open Platform

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.

[... 839 words]

2:28 pm / 10th March 2009 / apis, atom, contentapi, data, data-journalism, datastore, guardian, javascript, journalism, jquery, json, openplatform, python, simon-rogers, tom-watson, xml

JsonML (JSON Markup Language). An almost non-lossy serialization format for sending XML as JSON (plain text in between elements is ignored). Uses the (element-name, attribute-dictionary, list-of-children) tuple format, which sadly means many common cases end up taking more bytes than the original XML. Still an improvement on serializations that behave differently when a list of children has only one item in it.

# 10th February 2009, 3:03 pm / json, jsonml, serialization, xml

Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.

# 24th January 2009, 11:52 pm / crowbar, dom, gecko, greasemonkey, mozilla, rdf, scraping, webservice, xml, xulrunner

2008

How to install lxml python module on mac os 10.5 (leopard). Instructions that work! Finally, I can find out what all the fuss is about.

# 15th December 2008, 12:05 am / leopard, libxml2, lxml, macos, python, xml

pyquery. “A jQuery-like library for Python”—implemented on top of lxml, providing jQuery style methods for manipulating an HTML or XML document.

# 6th December 2008, 9:53 am / jquery, lxml, pyquery, python, xml

Magnificent Seven—the value of Atom. The seven core things that Atom solves so that you don’t have to.

# 19th October 2008, 10:24 pm / atom, bill-de-hora, rest, xml

cascadenik: cascading sheets of style for mapnik. Great idea. Mapnik (the open source tile rendering system used by OpenStreetMap and others) has a complex style configuration based on XML. Michal Migurski has build a CSS-style equivalent which compiles down to XML, hopefully making it much quicker and easier to get started with Mapnik customisation.

# 30th August 2008, 10:04 am / cascadenik, css, mapnik, mapping, michal-migurski, openstreetmap, xml

Tip: Configure SAX parsers for secure processing. Explains the billion laughs attack, among others.

# 23rd August 2008, 11:12 am / billionlaughs, elliotte-rusty-harold, sax, security, xml

DoS vulnerability in REXML. Ruby’s REXML library is susceptible to the “billion laughs” denial of service attack where recursively nested entities expand a single entitity reference to a billion characters (kind of like the exploding zip file attack). Rails applications that process user-supplied XML should apply the monkey-patch ASAP; a proper gem update is forthcoming.

# 23rd August 2008, 11:11 am / billionlaughs, denial-of-service, rails, rexml, ruby, security, xml

page 1 / 4 next » last »»

Simon Willison’s Weblog