Moshi (via) Moshi is "a speech-text foundation model and full-duplex spoken dialogue framework". It's effectively a text-to-text model - like an LLM but you input audio directly to it and it replies with its own audio.
It's fun to play around with, but it's not particularly useful in comparison to other pure text models: I tried to talk to it about California Brown Pelicans and it gave me some very basic hallucinated thoughts about California Condors instead.
It's very easy to run locally, at least on a Mac (and likely on other systems too). I used uv
and got the 8 bit quantized version running as a local web server using this one-liner:
uv run --with moshi_mlx python -m moshi_mlx.local_web -q 8
That downloads ~8.17G of model to a folder in ~/.cache/huggingface/hub/
- or you can use -q 4
and get a 4.81G version instead (albeit even lower quality).
Recent articles
- The last six months in LLMs, illustrated by pelicans on bicycles - 6th June 2025
- Tips on prompting ChatGPT for UK technology secretary Peter Kyle - 3rd June 2025
- How often do LLMs snitch? Recreating Theo's SnitchBench with LLM - 31st May 2025