Archive for Monday, 22nd May 2023

Monday, 22nd May 2023

TIL hexdump and hexdump -C — While exploring null bytes in [this issue](https://github.com/simonw/ttok/issues/3) I learned that the `hexdump` command on macOS (and presumably other Unix systems) has a confusing default output.

22nd May 2023, 6:01 pm

TIL mlc-chat - RedPajama-INCITE-Chat-3B on macOS — MLC (Machine Learning Compilation) on May 22nd 2023: [Bringing Open Large Language Models to Consumer Devices](https://mlc.ai/blog/2023/05/22/bringing-open-large-language-models-to-consumer-devices)

22nd May 2023, 7:04 pm

Introducing speech-to-text, text-to-speech, and more for 1,100+ languages (via) New from Meta AI: Massively Multilingual Speech. “MMS supports speech-to-text and text-to-speech for 1,107 languages and language identification for over 4,000 languages. [...] Some of these, such as the Tatuyo language, have only a few hundred speakers, and for most of these languages, no prior speech technology exists.”

It’s licensed CC-BY-NC 4.0 though, so it’s not available for commercial use.

“In a like-for-like comparison with OpenAI’s Whisper, we found that models trained on the Massively Multilingual Speech data achieve half the word error rate, but Massively Multilingual Speech covers 11 times more languages.”

The training data was mostly sourced from audio Bible translations.

# 7:22 pm / facebook, translation, ai, training-data

MLC: Bringing Open Large Language Models to Consumer Devices (via) “We bring RedPajama, a permissive open language model to WebGPU, iOS, GPUs, and various other platforms.” I managed to get this running on my Mac (see via link) with a few tweaks to their official instructions.

# 7:25 pm / ai, generative-ai, local-llms, llms, mlc, redpajama, webgpu, gpus

MMS Language Coverage in Datasette Lite. I converted the HTML table of 4,021 languages supported by Meta’s new Massively Multilingual Speech models to newline-delimited JSON and loaded it into Datasette Lite. Faceting by Language Family is particularly interesting—the top five families represented are Niger-Congo with 1,019, Austronesian with 609, Sino-Tibetan with 288, Indo-European with 278 and Afro-Asiatic with 222.

# 8:01 pm / facebook, ai, datasette, datasette-lite

← Sunday, 21st May 2023

Tuesday, 23rd May 2023 →

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Simon Willison’s Weblog

Monday, 22nd May 2023