1,410 posts tagged “ai”
"AI is whatever hasn't been done yet"—Larry Tesler
2024
Box shadow CSS generator (via) Another example of a tiny personal tool I built using Claude 3.5 Sonnet and artifacts. In this case my prompt was:
CSS for a slight box shadow, build me a tool that helps me twiddle settings and preview them and copy and paste out the CSS
I changed my mind half way through typing the prompt and asked it for a custom tool, and it built me this!
Here's the full transcript - in a follow-up prompt I asked for help deploying it and it rewrote the tool to use <script type="text/babel">
and the babel-standalone library to add React JSX support directly in the browser - a bit of a hefty dependency (387KB compressed / 2.79MB total) but I think acceptable for this kind of one-off tool.
Being able to knock out tiny custom tools like this on a whim is a really interesting new capability. It's also a lot of fun!
Voters in the Clapham and Brixton Hill constituency can rest easy - despite appearances, their Reform candidate Mark Matlock really does exist. [...] Matlock - based in the South Cotswolds, some 100 miles from the constituency in which he is standing - confirmed: "I am a real person." Although his campaign image is Al-generated, he said this was for lack of a real photo of him wearing a tie in Reform's trademark turquoise.
Home-Cooked Software and Barefoot Developers. I really enjoyed this talk by Maggie Appleton from this year's Local-first Conference in Berlin.
For the last ~year I've been keeping a close eye on how language models capabilities meaningfully change the speed, ease, and accessibility of software development. The slightly bold theory I put forward in this talk is that we're on a verge of a golden age of local, home-cooked software and a new kind of developer – what I've called the barefoot developer.
It's a great talk, and the design of the slides is outstanding.
It reminded me of Robin Sloan's An app can be a home-cooked meal, which Maggie references in the talk. Also relevant: this delightful recent Hacker News thread, Ask HN: Is there any software you only made for your own use but nobody else?
My favourite version of our weird new LLM future is one where the pool of people who can use computers to automate things in their life is massively expanded.
The other videos from the conference are worth checking out too.
The expansion of the jagged frontier of AI capability is subtle and requires a lot of experience with various models to understand what they can, and can’t, do. That is why I suggest that people and organizations keep an “impossibility list” - things that their experiments have shown that AI can definitely not do today but which it can almost do. For example, no AI can create a satisfying puzzle or mystery for you to solve, but they are getting closer. When AI models are updated, test them on your impossibility list to see if they can now do these impossible tasks.
Exorcising us of the Primer (via) Andy Matuschak talks about the need for educational technologists to break free from the siren's call of "The Young Lady’s Illustrated Primer" - the universal interactive textbook described by Neal Stephenson in his novel The Diamond Age.
The Primer offers an incredibly compelling vision, and Andy uses fifteen years of his own experience exploring related ideas to pick it apart and highlight its flaws.
I want to exorcise myself of the Primer. I want to clearly delineate what makes its vision so compelling—what I want to carry in my heart as a creative fuel. But I also want to sharply clarify the lessons we shouldn’t take from the Primer, and what it simply ignores. Then I want to reconstitute all that into something new, a vision I can use to drive my work forward.
On the Primer's authoritarianism:
The Primer has an agenda. It is designed to instill a set of values and ideas, and while it’s supportive of Nell’s curiosities, those are “side quests” to its central structure. Each of the twelve “Lands Beyond” focuses on different topics, but they’re not specific to Nell, and Nell didn’t choose them. In fact, Nell doesn’t even know the Primer’s goals for her—she’s never told. Its goals are its own privileged secret. Nell is manipulated so completely by the Primer, for so much of her life, that it’s hard to determine whether she has meaningful goals or values, other than those the Primer’s creators have deemed “good for her”.
I'm also reminded of Stephenson's piece of advice to people who may have missed an important lesson from the novel:
Kids need to get answers from humans who love them.
If you own the tracks between San Francisco and Los Angeles, you likely have some kind of monopolistic pricing power, because there can only be so many tracks laid between place A and place B. In the case of GPU data centers, there is much less pricing power. GPU computing is increasingly turning into a commodity, metered per hour. Unlike the CPU cloud, which became an oligopoly, new entrants building dedicated AI clouds continue to flood the market. Without a monopoly or oligopoly, high fixed cost + low marginal cost businesses almost always see prices competed down to marginal cost (e.g., airlines).
Chrome Prompt Playground.
Google Chrome Canary is currently shipping an experimental on-device LLM, in the form of Gemini Nano. You can access it via the new window.ai
API, after first enabling the "Prompt API for Gemini Nano" experiment in chrome://flags
(and then waiting an indeterminate amount of time for the ~1.7GB model file to download - I eventually spotted it in ~/Library/Application Support/Google/Chrome Canary/OptGuideOnDeviceModel
).
I got Claude 3.5 Sonnet to build me this playground interface for experimenting with the model. You can execute prompts, stream the responses and all previous prompts and responses are stored in localStorage
.
Here's the full Sonnet transcript, and the final source code for the app.
The best documentation I've found for the new API is is explainers-by-googlers/prompt-api on GitHub.
gemma-2-27b-it-llamafile (via) Justine Tunney shipped llamafile packages of Google's new openly licensed (though definitely not open source) Gemma 2 27b model this morning.
I downloaded the gemma-2-27b-it.Q5_1.llamafile
version (20.5GB) to my Mac, ran chmod 755 gemma-2-27b-it.Q5_1.llamafile
and then ./gemma-2-27b-it.Q5_1.llamafile
and now I'm trying it out through the llama.cpp
default web UI in my browser. It works great.
It's a very capable model - currently sitting at position 12 on the LMSYS Arena making it the highest ranked open weights model - one position ahead of Llama-3-70b-Instruct and within striking distance of the GPT-4 class models.
The Super Effectiveness of Pokémon Embeddings Using Only Raw JSON and Images.
A deep dive into embeddings from Max Woolf, exploring 1,000 different Pokémon (loaded from PokéAPI using this epic GraphQL query) and then embedding the cleaned up JSON data using nomic-embed-text-v1.5
and the official Pokémon image representations using nomic-embed-vision-v1.5
.
I hadn't seen nomic-embed-vision-v1.5 before: it brings multimodality to Nomic embeddings and operates in the same embedding space as nomic-embed-text-v1.5
which means you can use it to perform CLIP-style tricks comparing text and images. Here's their announcement from June 5th:
Together, Nomic Embed is the only unified embedding space that outperforms OpenAI CLIP and OpenAI Text Embedding 3 Small on multimodal and text tasks respectively.
Sadly the new vision weights are available under a non-commercial Creative Commons license (unlike the text weights which are Apache 2), so if you want to use the vision weights commercially you'll need to access them via Nomic's paid API.
Nomic do say this though:
As Nomic releases future models, we intend to re-license less recent models in our catalogue under the Apache-2.0 license.
Update 17th January 2025: Nomic Embed Vision 1.5 is now Apache 2.0 licensed.
Absolutely any time I try to explore something even slightly against commonly accepted beliefs, LLMs always just rehash the commonly accepted beliefs.
As a researcher, I find this behaviour worse than unhelpful. It gives the mistaken impression that there's nothing to explore.
We argued that ChatGPT is not designed to produce true utterances; rather, it is designed to produce text which is indistinguishable from the text produced by humans. It is aimed at being convincing rather than accurate. The basic architecture of these models reveals this: they are designed to come up with a likely continuation of a string of text. It’s reasonable to assume that one way of being a likely continuation of a text is by being true; if humans are roughly more accurate than chance, true sentences will be more likely than false ones. This might make the chatbot more accurate than chance, but it does not give the chatbot any intention to convey truths. This is similar to standard cases of human bullshitters, who don’t care whether their utterances are true; good bullshit often contains some degree of truth, that’s part of what makes it convincing.
Accidental GPT-4o voice preview (via) Reddit user RozziTheCreator was one of a small group who were accidentally granted access to the new multimodal GPT-4o audio voice feature. They captured this video of it telling them a spooky story, complete with thunder sound effects added to the background and in a very realistic voice that clearly wasn't the one from the 4o demo that sounded similar to Scarlet Johansson.
OpenAI provided a comment for this Tom's Guide story confirming the accidental rollout so I don't think this is a faked video.
Open challenges for AI engineering
I gave the opening keynote at the AI Engineer World’s Fair yesterday. I was a late addition to the schedule: OpenAI pulled out of their slot at the last minute, and I was invited to put together a 20 minute talk with just under 24 hours notice!
[... 5,640 words]picopilot (via) Kyle Carberry's "GitHub Copilot in 70 lines of JavaScript". The title is a little hyperbolic, but the code itself really does implement an OpenAI powered Visual Studio Code text completion extension in 71 lines of code. This is an excellent example for learning what a minimal VS Code extension looks like.
Here's the system prompt it uses:
You provide code completion results given a prefix and suffix. Respond with a JSON object with the key 'completion' containing a suggestion to place between the prefix and suffix. Follow existing code styles. Listen to comments at the end of the prefix. The language is "{language}".
Then it passes the prefix and suffix as two user messages, and uses the "response_format": {"type": "json_object"}
option to enforce JSON output from the GPT-4o API.
The feature this is missing is the thing that makes GitHub Copilot so impressive: Copilot does a whole bunch of clever tricks to find snippets of relevant code from the current and other nearby files and includes them with the prompt, resulting in much higher quality completions.
Claude Projects. New Claude feature, quietly launched this morning for Claude Pro users. Looks like their version of OpenAI's GPTs, designed to take advantage of Claude's 200,000 token context limit:
You can upload relevant documents, text, code, or other files to a project’s knowledge base, which Claude will use to better understand the context and background for your individual chats within that project. Each project includes a 200K context window, the equivalent of a 500-page book, so users can add all of the insights needed to enhance Claude’s effectiveness.
You can also set custom instructions, which presumably get added to the system prompt.
I tried dropping in all of Datasette's existing documentation - 693KB of .rst
files (which I had to rename to .rst.txt
for it to let me upload them) - and it worked and showed "63% of knowledge size used".
This is a slightly different approach from OpenAI, where the GPT knowledge feature supports attaching up to 20 files each with up to 2 million tokens, which get ingested into a vector database (likely Qdrant) and used for RAG.
It looks like Claude instead handle a smaller amount of extra knowledge but paste the whole thing into the context window, which avoids some of the weirdness around semantic search chunking but greatly limits the size of the data.
My big frustration with the knowledge feature in GPTs remains the lack of documentation on what it's actually doing under the hood. Without that it's difficult to make informed decisions about how to use it - with Claude Projects I can at least develop a robust understanding of what the tool is doing for me and how best to put it to work.
No equivalent (yet) for the GPT actions feature where you can grant GPTs the ability to make API calls out to external systems.
Listen to the AI-generated ripoff songs that got Udio and Suno sued. Jason Koebler reports on the lawsuit filed today by the RIAA against Udio and Suno, the two leading generative music startups.
The lawsuit includes examples of prompts that the record labels used to recreate famous songs that were almost certainly included in the (undisclosed) training data. Jason collected some of these together into a three minute video, and the result in pretty damning. Arguing "fair use" isn't going to be easy here.
What Apple unveiled last week with Apple Intelligence wasn't so much new products, but new features—a slew of them—for existing products, powered by generative AI.
[...] These aren't new apps or new products. They're the most used, most important apps Apple makes, the core apps that define the Apple platforms ecosystem, and Apple is using generative AI to make them better and more useful—without, in any way, rendering them unfamiliar.
llama.ttf (via) llama.ttf is "a font file which is also a large language model and an inference engine for that model".
You can see it kick into action at 8m28s in this video, where creator Søren Fuglede Jørgensen types "Once upon a time" followed by dozens of exclamation marks, and those exclamation marks then switch out to render a continuation of the story. But... when they paste the code out of the editor again it shows as the original exclamation marks were preserved - the LLM output was presented only in the way they were rendered.
The key trick here is that the font renderer library HarfBuzz (used by Firefox, Chrome, Android, GNOME and more) added a new WebAssembly extension in version 8.0 last year, which is powerful enough to run a full LLM based on the tinyllama-15M model - which fits in a 60MB font file.
(Here's a related demo from Valdemar Erk showing Tetris running in a WASM font, at 22m56s in this video.)
The source code for llama.ttf is available on GitHub.
The people who are most confident AI can replace writers are the ones who think writing is typing.
In our “who validates the validators” user studies, we found that people expected—and also desired—for the LLM to learn from any human interaction. That too, “as efficiently as possible” (ie after 1-2 demonstrations, the LLM should “get it”)
Building search-based RAG using Claude, Datasette and Val Town
Retrieval Augmented Generation (RAG) is a technique for adding extra “knowledge” to systems built on LLMs, allowing them to answer questions against custom information not included in their training data. A common way to implement this is to take a question from a user, translate that into a set of search queries, run those against a search engine and then feed the results back into the LLM to generate an answer.
[... 3,372 words]OpenAI was founded to build artificial general intelligence safely, free of outside commercial pressures. And now every once in a while it shoots out a new AI firm whose mission is to build artificial general intelligence safely, free of the commercial pressures at OpenAI.
Val Vibes: Semantic search in Val Town. A neat case-study by JP Posma on how Val Town's developers can use Val Town Vals to build prototypes of new features that later make it into Val Town core.
This one explores building out semantic search against Vals using OpenAI embeddings and the PostgreSQL pgvector extension.
It is in the public good to have AI produce quality and credible (if ‘hallucinations’ can be overcome) output. It is in the public good that there be the creation of original quality, credible, and artistic content. It is not in the public good if quality, credible content is excluded from AI training and output OR if quality, credible content is not created.
llm-claude-3 0.4. LLM plugin release adding support for the new Claude 3.5 Sonnet model:
pipx install llm
llm install -U llm-claude-3
llm keys set claude
# paste AP| key here
llm -m claude-3.5-sonnet \
'a joke about a pelican and a walrus having lunch'
One of the core constitutional principles that guides our AI model development is privacy. We do not train our generative models on user-submitted data unless a user gives us explicit permission to do so. To date we have not used any customer or user-submitted data to train our generative models.
Claude 3.5 Sonnet. Anthropic released a new model this morning, and I think it's likely now the single best available LLM. Claude 3 Opus was already mostly on-par with GPT-4o, and the new 3.5 Sonnet scores higher than Opus on almost all of Anthropic's internal evals.
It's also twice the speed and one fifth of the price of Opus (it's the same price as the previous Claude 3 Sonnet). To compare:
- gpt-4o: $5/million input tokens and $15/million output
- Claude 3.5 Sonnet: $3/million input, $15/million output
- Claude 3 Opus: $15/million input, $75/million output
Similar to Claude 3 Haiku then, which both under-cuts and out-performs OpenAI's GPT-3.5 model.
In addition to the new model, Anthropic also added a "artifacts" feature to their Claude web interface. The most exciting part of this is that any of the Claude models can now build and then render web pages and SPAs, directly in the Claude interface.
This means you can prompt them to e.g. "Build me a web app that teaches me about mandelbrot fractals, with interactive widgets" and they'll do exactly that - I tried that prompt on Claude 3.5 Sonnet earlier and the results were spectacular (video demo).
An unsurprising note at the end of the post:
To complete the Claude 3.5 model family, we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.
If the pricing stays consistent with Claude 3, Claude 3.5 Haiku is going to be a very exciting model indeed.
[...] And then some absolute son of a bitch created ChatGPT, and now look at us. Look at us, resplendent in our pauper's robes, stitched from corpulent greed and breathless credulity, spending half of the planet's engineering efforts to add chatbot support to every application under the sun when half of the industry hasn't worked out how to test database backups regularly.
State-of-the-art music scanning by Soundslice. It's been a while since I checked in on Soundslice, Adrian Holovaty's beautiful web application focused on music education.
The latest feature is spectacular. The Soundslice music editor - already one of the most impressive web applications I've ever experienced - can now import notation directly from scans or photos of sheet music.
The attention to detail is immaculate. The custom machine learning model can handle a wide variety of notation details, and the system asks the user to verify or correct details that it couldn't perfectly determine using a neatly designed flow.
Free accounts can scan two single page documents a month, and paid plans get a much higher allowance. I tried it out just now on a low resolution image I found on Wikipedia and it did a fantastic job, even allowing me to listen to a simulated piano rendition of the music once it had finished processing.
It's worth spending some time with the release notes for the feature to appreciate how much work they've out into improving it since the initial release.
If you're new to Soundslice, here's an example of their core player interface which syncs the display of music notation to an accompanying video.
Adrian wrote up some detailed notes on the machine learning behind the feature when they first launched it in beta back in November 2022.
OMR [Optical Music Recognition] is an inherently hard problem, significantly more difficult than text OCR. For one, music symbols have complex spatial relationships, and mistakes have a tendency to cascade. A single misdetected key signature might result in multiple incorrect note pitches. And there’s a wide diversity of symbols, each with its own behavior and semantics — meaning the problems and subproblems aren’t just hard, there are many of them.
Tags with descriptions. Tiny new feature on my blog: I can now add optional descriptions to my tag pages, for example on datasette and sqlite-utils and prompt-injection.
I built this feature on a live call this morning as an unplanned demonstration of GitHub's new Copilot Workspace feature, where you can run a prompt against a repository and have it plan, implement and file a pull request implementing a change to the code.
My prompt was:
Add a feature that lets me add a description to my tag pages, stored in the database table for tags and visible on the /tags/x/ page at the top
It wasn't as compelling a demo as I expected: Copilot Workspace currently has to stream an entire copy of each file it modifies, which can take a long time if your codebase includes several large files that need to be changed.
It did create a working implementation on its first try, though I had given it an extra tip not to forget the database migration. I ended up making a bunch of changes myself before I shipped it, listed in the pull request.
I've been using Copilot Workspace quite a bit recently as a code explanation tool - I'll prompt it to e.g. "add architecture documentation to the README" on a random repository not owned by me, then read its initial plan to see what it's figured out without going all the way through to the implementation and PR phases. Example in this tweet where I figured out the rough design of the Jina AI Reader API for this post.