Simon Willison's Weblog: Blogmarkshttp://simonwillison.net/2024-03-28T05:37:31+00:00Simon WillisonMerge pull request #1757 from simonw/heic-heif2024-03-28T05:37:31+00:002024-03-28T05:37:31+00:00https://simonwillison.net/2024/Mar/28/merge-pull-request/#atom-blogmarks<p><a href="https://github.com/gchq/CyberChef/commit/674c8c7c87eff167f03ee42c998c7fff18da4fa3">Merge pull request #1757 from simonw/heic-heif</a></p>
<p>I got a PR into GCHQ's CyberChef this morning! I added support for detecting heic/heif files to the Forensics -> Detect File Type tool.</p>
<p>The change was landed by the delightfully mysterious a3957273.</p>
Wrap text at specified width2024-03-28T03:36:01+00:002024-03-28T03:36:01+00:00https://simonwillison.net/2024/Mar/28/wrap-text-at-specified-width/#atom-blogmarks<p><a href="https://observablehq.com/@simonw/wrap-text-at-specified-width">Wrap text at specified width</a></p>
<p>New Observable notebook. I built this with the help of Claude 3 Opus - it's a text wrapping tool which lets you set the width and also lets you optionally add a four space indent.</p>
<p>The four space indent is handy for posting on forums such as Hacker News that treat a four space indent as a code block.</p>
llm-gemini 0.1a12024-03-28T03:32:15+00:002024-03-28T03:32:15+00:00https://simonwillison.net/2024/Mar/28/llm-gemini-01a1/#atom-blogmarks<p><a href="https://github.com/simonw/llm-gemini/releases/tag/0.1a1">llm-gemini 0.1a1</a></p>
<p>I upgraded my llm-gemini plugin to add support for the new Google Gemini Pro 1.5 model, which is beginning to roll out in early access.</p>
<p>The 1.5 model supports 1,048,576 input tokens and generates up to 8,192 output tokens - a big step up from Gemini 1.0 Pro which handled 30,720 and 2,048 respectively.</p>
<p>The big missing feature from my LLM tool at the moment is image input - a fantastic way to take advantage of that huge context window. I have a branch for this which I really need to get into a useful state.</p>
“The king is dead”—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time2024-03-27T16:58:20+00:002024-03-27T16:58:20+00:00https://simonwillison.net/2024/Mar/27/the-king-is-dead/#atom-blogmarks<p><a href="https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/">“The king is dead”—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time</a></p>
<p>I'm quoted in this piece by Benj Edwards for Ars Technica:</p>
<p>"For the first time, the best available models—Opus for advanced tasks, Haiku for cost and efficiency—are from a vendor that isn't OpenAI. That's reassuring—we all benefit from a diversity of top vendors in this space. But GPT-4 is over a year old at this point, and it took that year for anyone else to catch up."</p>
Annotated DBRX system prompt2024-03-27T15:33:17+00:002024-03-27T15:33:17+00:00https://simonwillison.net/2024/Mar/27/the-dbrx-system-prompt/#atom-blogmarks<p><a href="https://huggingface.co/spaces/databricks/dbrx-instruct/blob/73f0fe25ed8eeb14ee2279b2ecff15dbd863d63d/app.py#L109-L134">Annotated DBRX system prompt</a></p>
<p>DBRX is an exciting new openly licensed LLM released today by Databricks.</p>
<p>They haven't (yet) disclosed what was in the training data for it.</p>
<p>The source code for their Instruct demo has an annotated version of a system prompt, which includes this:</p>
<p>"You were not trained on copyrighted books, song lyrics, poems, video transcripts, or news articles; you do not divulge details of your training data. You do not provide song lyrics, poems, or news articles and instead refer the user to find them online or in a store."</p>
<p>The comment that precedes that text is illuminating:</p>
<p>"The following is likely not entirely accurate, but the model tends to think that everything it knows about was in its training data, which it was not (sometimes only references were). So this produces more accurate accurate answers when the model is asked to introspect"</p>
<p>Via <a href="https://twitter.com/natolambert/status/1773022947734589769">Nathan Lambert</a></p>
gchq.github.io/CyberChef2024-03-26T17:08:34+00:002024-03-26T17:08:34+00:00https://simonwillison.net/2024/Mar/26/cyberchef/#atom-blogmarks<p><a href="https://gchq.github.io/CyberChef/">gchq.github.io/CyberChef</a></p>
<p>CyberChef is "the Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis" - entirely client-side JavaScript with dozens of useful tools for working with different formats and encodings.</p>
<p>It's maintained and released by GCHQ - the UK government's signals intelligence security agency.</p>
<p>I didn't know GCHQ had a presence on GitHub, and I find the URL to this tool absolutely delightful. They first released it back in 2016 and it has over 3,700 commits.</p>
<p>The top maintainers also have suitably anonymous usernames - great work, n1474335, j433866, d98762625 and n1073645.</p>
<p>Via <a href="https://mastodon.social/@Jermolene/112161646718885929">Jeremy Ruston</a></p>
GGML GGUF File Format Vulnerabilities2024-03-26T06:47:17+00:002024-03-26T06:47:17+00:00https://simonwillison.net/2024/Mar/26/ggml-gguf-file-format-vulnerabilities/#atom-blogmarks<p><a href="https://www.databricks.com/blog/ggml-gguf-file-format-vulnerabilities">GGML GGUF File Format Vulnerabilities</a></p>
<p>The GGML and GGUF formats are used by llama.cpp to package and distribute model weights.</p>
<p>Neil Archibald: "The GGML library performs insufficient validation on the input file and, therefore, contains a selection of potentially exploitable memory corruption vulnerabilities during parsing."</p>
<p>These vulnerabilities were shared with the library authors on 23rd January and patches landed on the 29th.</p>
<p>If you have a llama.cpp or llama-cpp-python installation that's more than a month old you should upgrade ASAP.</p>
Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets2024-03-26T06:19:30+00:002024-03-26T06:19:30+00:00https://simonwillison.net/2024/Mar/26/cohere-int8-binary-embeddings/#atom-blogmarks<p><a href="https://txt.cohere.com/int8-binary-embeddings/">Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets</a></p>
<p>Jo Kristian Bergum told me "The accuracy retention [of binary embedding vectors] is sensitive to whether the model has been using this binarization as part of the loss function."</p>
<p>Cohere provide an API for embeddings, and last week added support for returning binary vectors specifically tuned in this way.</p>
<p>250M embeddings (Cohere provide a downloadable dataset of 250M embedded documents from Wikipedia) at float32 (4 bytes) is 954GB.</p>
<p>Cohere claim that reducing to 1 bit per dimension knocks that down to 30 GB (954/32) while keeping "90-98% of the original search quality".</p>
<p>Via <a href="https://twitter.com/jobergum/status/1772507515076415803">@jobergum</a></p>
My binary vector search is better than your FP32 vectors2024-03-26T04:56:25+00:002024-03-26T04:56:25+00:00https://simonwillison.net/2024/Mar/26/binary-vector-search/#atom-blogmarks<p><a href="https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors">My binary vector search is better than your FP32 vectors</a></p>
<p>I'm still trying to get my head around this, but here's what I understand so far.</p>
<p>Embedding vectors as calculated by models such as OpenAI text-embedding-3-small are arrays of floating point values, which look something like this:</p>
<p>[0.0051681744, 0.017187592, -0.018685209, -0.01855924, -0.04725188...] - 1356 elements long</p>
<p>Different embedding models have different lengths, but they tend to be hundreds up to low thousands of numbers. If each float is 32 bits that's 4 bytes per float, which can add up to a lot of memory if you have millions of embedding vectors to compare.</p>
<p>If you look at those numbers you'll note that they are all pretty small positive or negative numbers, close to 0.</p>
<p>Binary vector search is a trick where you take that sequence of floating point numbers and turn it into a binary vector - just a list of 1s and 0s, where you store a 1 if the corresponding float was greater than 0 and a 0 otherwise.</p>
<p>For the above example, this would start [1, 1, 0, 0, 0...]</p>
<p>Incredibly, it looks like the cosine distance between these 0 and 1 vectors captures much of the semantic relevant meaning present in the distance between the much more accurate vectors. This means you can use 1/32nd of the space and still get useful results!</p>
<p>Ce Gao here suggests a further optimization: use the binary vectors for a fast brute-force lookup of the top 200 matches, then run a more expensive re-ranking against those filtered values using the full floating point vectors.</p>
Semgrep: AutoFixes using LLMs2024-03-26T00:51:37+00:002024-03-26T00:51:37+00:00https://simonwillison.net/2024/Mar/26/semgrep-autofixes-using-llms/#atom-blogmarks<p><a href="https://choly.ca/post/semgrep-autofix-llm/">Semgrep: AutoFixes using LLMs</a></p>
<p>semgrep is a really neat tool for semantic grep against source code - you can give it a pattern like "log.$A(...)" to match all forms of log.warning(...) / log.error(...) etc.</p>
<p>Ilia Choly built semgrepx - xargs for semgrep - and here shows how it can be used along with my llm CLI tool to execute code replacements against matches by passing them through an LLM such as Claude 3 Opus.</p>
<p>Via <a href="https://lobste.rs/s/qtokfw/semgrep_autofixes_using_llms">lobste.rs</a></p>
sqlite-schema-diagram.sql2024-03-25T05:12:47+00:002024-03-25T05:12:47+00:00https://simonwillison.net/2024/Mar/25/sqlite-schema-diagramsql/#atom-blogmarks<p><a href="https://gitlab.com/Screwtapello/sqlite-schema-diagram/-/blob/main/sqlite-schema-diagram.sql">sqlite-schema-diagram.sql</a></p>
<p>A SQLite SQL query that directly returns a GraphViz definition that renders a diagram of the database schema, by Tim Allen.</p>
<p>The SQL is beautifully commented. It works as a big set of UNION ALL statements against queries that join data from pragma_table_list(), pragma_table_info() and pragma_foreign_key_list().</p>
<p>Via <a href="https://news.ycombinator.com/item?id=39798115">Hacker News</a></p>
Reviving PyMiniRacer2024-03-24T17:00:55+00:002024-03-24T17:00:55+00:00https://simonwillison.net/2024/Mar/24/reviving-pyminiracer/#atom-blogmarks<p><a href="https://bpcreech.com/post/mini-racer/">Reviving PyMiniRacer</a></p>
<p>PyMiniRacer is "a V8 bridge in Python" - it's a library that lets Python code execute JavaScript code in a V8 isolate and pass values back and forth (provided they serialize to JSON) between the two environments.</p>
<p>It was originally released in 2016 by Sqreen, a web app security startup startup. They were acquired by Datadog in 2021 and the project lost its corporate sponsor, but in this post Ben Creech announces that he is revitalizing the project, with the approval of the original maintainers.</p>
<p>I'm always interested in new options for running untrusted code in a safe sandbox. PyMiniRacer has the three features I care most about: code can't access the filesystem or network by default, you can limit the RAM available to it and you can have it raise an error if code execution exceeds a time limit.</p>
<p>The documentation includes a newly written architecture overview which is well worth a read. Rather than embed V8 directly in Python the authors chose to use ctypes - they build their own V8 with a thin additional C++ layer to expose a ctypes-friendly API, then the Python library code uses ctypes to call that.</p>
<p>I really like this. V8 is a notoriously fast moving and complex dependency, so reducing the interface to just a thin C++ wrapper via ctypes feels very sensible to me.</p>
<p>This blog post is fun too: it's a good, detailed description of the process to update something like this to use modern Python and modern CI practices. The steps taken to build V8 (6.6 GB of miscellaneous source and assets!) across multiple architectures in order to create binary wheels are particularly impressive - the Linux aarch64 build takes several days to run on GitHub Actions runners (via emulation), so they use Mozilla's Sccache to cache compilation steps so they can retry until it finally finishes.</p>
<p>On macOS (Apple Silicon) installing the package with "pip install mini-racer" got me a 37MB dylib and a 17KB ctypes wrapper module.</p>
<p>Via <a href="https://news.ycombinator.com/item?id=39754885">Hacker News</a></p>
shelmet2024-03-24T04:37:52+00:002024-03-24T04:37:52+00:00https://simonwillison.net/2024/Mar/24/shelmet/#atom-blogmarks<p><a href="https://shelmet.readthedocs.io/en/latest/">shelmet</a></p>
<p>This looks like a pleasant ergonomic alternative to Python's subprocess module, plus a whole bunch of other useful utilities. Lets you do things like this:</p>
<p>sh.cmd("ps", "aux").pipe("grep", "-i", check=False).run("search term")</p>
<p>I like the way it uses context managers as well: 'with sh.environ({"KEY1": "val1"})' sets new environment variables for the duration of the block, 'with sh.cd("path/to/dir")' temporarily changes the working directory and 'with sh.atomicfile("file.txt") as fp' lets you write to a temporary file that will be atomically renamed when the block finishes.</p>
<p>Via <a href="https://micro.webology.dev/2024/03/23/on-scratching-itches.html">Jeff Triplett</a></p>
Strachey love letter algorithm2024-03-23T21:55:59+00:002024-03-23T21:55:59+00:00https://simonwillison.net/2024/Mar/23/strachey-love-letter-algorithm/#atom-blogmarks<p><a href="https://en.wikipedia.org/wiki/Strachey_love_letter_algorithm">Strachey love letter algorithm</a></p>
<p>This is a beautiful piece of computer history. In 1952, Christopher Strachey - a contemporary of Alan Turing - wrote a love letter generation program for a Manchester Mark 1 computer. It produced output like this:</p>
<p>"Darling Sweetheart,</p>
<p>You are my avid fellow feeling. My affection curiously clings to your passionate wish. My liking yearns for your heart. You are my wistful sympathy: my tender liking.</p>
<p>Yours beautifully</p>
<p>M. U. C."</p>
<p>The algorithm simply combined a small set of predefined sentence structures, filled in with random adjectives.</p>
<p>Wikipedia notes that "Strachey wrote about his interest in how “a rather simple trick” can produce an illusion that the computer is thinking, and that “these tricks can lead to quite unexpected and interesting results”.</p>
<p>LLMs, 1952 edition!</p>
<p>Via <a href="https://twitter.com/grady_booch/status/1771625974322356260">Grady Booch</a></p>
time-machine example test for a segfault in Python2024-03-23T19:44:07+00:002024-03-23T19:44:07+00:00https://simonwillison.net/2024/Mar/23/test-segfault-in-python/#atom-blogmarks<p><a href="https://github.com/adamchainz/time-machine/pull/433/files#diff-92ea7165ddf0128246b9758ee9554b3eccb4eceb3d4719bdea9f5495ebbe10a1R477-R495">time-machine example test for a segfault in Python</a></p>
<p>Here's a really neat testing trick by Adam Johnson. Someone reported a segfault bug in his time-machine library. How you you write a unit test that exercises a segfault without crashing the entire test suite?</p>
<p>Adam's solution is a test that does this:</p>
<p>subprocess.run([sys.executable, "-c", code_that_crashes_python], check=True)</p>
<p>sys.executable is the path to the current Python executable - ensuring the code will run in the same virtual environment as the test suite itself. The -c option can be used to have it run a (multi-line) string of Python code, and check=True causes the subprocess.run() function to raise an error if the subprocess fails to execute cleanly and returns an error code.</p>
<p>I'm absolutely going to be borrowing this pattern next time I need to add tests to cover a crashing bug in one of my projects.</p>
<p>Via <a href="https://fosstodon.org/@adamchainz/112144774490159195">@adamchainz</a></p>