13 items tagged “audio”
2024
Prompt GPT-4o audio. A week and a half ago I built a tool for experimenting with OpenAI's new audio input. I just put together the other side of that, for experimenting with audio output.
Once you've provided an API key (which is saved in localStorage) you can use this to prompt the gpt-4o-audio-preview
model with a system and regular prompt and select a voice for the response.
I built it with assistance from Claude: initial app, adding system prompt support.
You can preview and download the resulting wav
file, and you can also copy out the raw JSON. If you save that in a Gist you can then feed its Gist ID to https://tools.simonwillison.net/gpt-4o-audio-player?gist=GIST_ID_HERE
(Claude transcript) to play it back again.
You can try using that to listen to my French accented pelican description.
There's something really interesting to me here about this form of application which exists entirely as HTML and JavaScript that uses CORS to talk to various APIs. GitHub's Gist API is accessible via CORS too, so it wouldn't take much more work to add a "save" button which writes out a new Gist after prompting for a personal access token. I prototyped that a bit here.
Experimenting with audio input and output for the OpenAI Chat Completion API
OpenAI promised this at DevDay a few weeks ago and now it’s here: their Chat Completion API can now accept audio as input and return it as output. OpenAI still recommend their WebSocket-based Realtime API for audio tasks, but the Chat Completion API is a whole lot easier to write code against.
[... 1,522 words]NotebookLM’s automatically generated podcasts are surprisingly effective
Audio Overview is a fun new feature of Google’s NotebookLM which is getting a lot of attention right now. It generates a one-off custom podcast against content you provide, where two AI hosts start up a “deep dive” discussion about the collected content. These last around ten minutes and are very podcast, with an astonishingly convincing audio back-and-forth conversation.
[... 1,489 words]2023
Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning. Andy Baio reports back on his investigations into the world of AI voice cloning.
This is no longer a niche interest. There’s a Discord with 500,000 members sharing tips and tricks on cloning celebrity voices in order to make their own cover songs, often built with Google Colab using models distributed through Hugging Face.
Andy then makes his own, playing with the concept “What if every Weird Al song was the original, and every other artist was covering his songs instead?”
I particularly enjoyed Madonna’s cover of “Like A Surgeon”, Lady Gaga’s “Perform This Way” and Lorde’s “Foil”.
textra (via) Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too!
2010
Audio Sprites (and fixes for iOS). Remy Sharp on the limitations of HTML5 audio support in iOS.
ZOMBO.com in HTML5. Uses SVG (scripted by JavaScript) and the audio element. Finally, Zombo.com comes to the iPad.
Video on the Web—Dive Into HTML5. Everything a web developer needs to know about video containers, video codecs, adio containers, audio codecs, h.264, theora, vorbis, licensing, encoding, batch encoding and the html5 video element.
HTML 5 audio player demo. Scott Andrew’s experiments with the HTML5 audio element (and jQuery)—straight forward and works a treat in Safari, but Firefox doesn’t support MP3. Presumably it’s not too hard to set up a fallback for Ogg.
2009
Codecs for <audio> and <video>. HTML 5 will not be requiring support for specific audio and video codecs—Ian Hickson explains why, in great detail. Short version: Apple won’t implement Theora due to lack of hardware support and an “uncertain patent landscape”, while open source browsers (Chromium and Mozilla) can’t support H.264 due to the cost of the licenses.
Firefox 3.5 for developers. It’s out today, and the feature list is huge. Highlights include HTML 5 drag ’n’ drop, audio and video elements, offline resources, downloadable fonts, text-shadow, CSS transforms with -moz-transform, localStorage, geolocation, web workers, trackpad swipe events, native JSON, cross-site HTTP requests, text API for canvas, defer attribute for the script element and TraceMonkey for better JS performance!
2007
HTML5 Media Support in WebKit. WebKit continues to lead the pack when it comes to trying out new HTML5 proposals. The new audio and video elements make embedding media easy, and provide a neat listener API for hooking in to “playback ended” events.
Audio Fingerprinting for Clean Metadata. Last.fm have started using audio fingerprints to help clean up misspelled artists and duplicate track information.