Simon Willison’s Weblog

Subscribe

Friday, 2nd February 2024

ChunkViz (via) Handy tool by Greg Kamradt to help understand how different text chunking mechanisms work by visualizing them. Chunking is an important part of preparing text to be embedded for semantic search, and thanks to this tool I’ve finally got a solid mental model of what recursive character text splitting does.

# 2:23 am / ai, embeddings

unstructured. Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.

I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.

There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.

# 2:47 am / ocr, pdf, python

For many people in many organizations, their measurable output is words - words in emails, in reports, in presentations. We use words as proxy for many things: the number of words is an indicator of effort, the quality of the words is an indicator of intelligence, the degree to which the words are error-free is an indicator of care.

[...] But now every employee with Copilot can produce work that checks all the boxes of a formal report without necessarily representing underlying effort.

Ethan Mollick

# 3:34 am / ethics, ai, generative-ai, llms, ethan-mollick

Samattical (via) Automattic (the company behind WordPress) have a benefit that’s provided to all 1,900+ of their employees: a paid three month sabbatical every five years.

CEO Matt Mullenweg is taking advantage of this for the first time, and here shares an Ignite talk in which he talks about the way the benefit encourages the company to plan for 5% of the company to be unavailable at any one time, helping avoid any single employee becoming a bottleneck.

# 3:42 am / automattic, matt-mullenweg

LLMs may offer immense value to society. But that does not warrant the violation of copyright law or its underpinning principles. We do not believe it is fair for tech firms to use rightsholder data for commercial purposes without permission or compensation, and to gain vast financial rewards in the process. There is compelling evidence that the UK benefits economically, politically and societally from upholding a globally respected copyright regime.

UK House of Lords report on Generative AI

# 3:54 am / ethics, politics, ai, generative-ai, llms

Open Language Models (OLMos) and the LLM landscape (via) OLMo is a newly released LLM from the Allen Institute for AI (AI2) currently available in 7b and 1b parameters (OLMo-65b is on the way) and trained on a fully openly published dataset called Dolma.

The model and code are Apache 2, while the data is under the “AI2 ImpACT license”.

From the benchmark scores shared here by Nathan Lambert it looks like this may be the highest performing model currently available that was built using a fully documented training set.

What’s in Dolma? It’s mainly Common Crawl, Wikipedia, Project Gutenberg and the Stack.

# 4:11 am / ai, generative-ai, llms, training-data