26 items tagged “text-to-image”
2024
Recraft V3. Recraft are a generative AI design tool startup based out of London who released their v3 model a few weeks ago. It's currently sat at the top of the Artificial Analysis Image Arena Leaderboard, beating Midjourney and Flux 1.1 pro.
The thing that impressed me is that it can generate both raster and vector graphics... and the vector graphics can be exported as SVG!
Here's what I got for raccoon with a sign that says "I love trash"
- SVG here.
That's an editable SVG - when I open it up in Pixelmator I can select and modify the individual paths and shapes:
They also have an API. I spent $1 on 1000 credits and then spent 80 credits (8 cents) making this SVG of a pelican riding a bicycle, using my API key stored in 1Password:
export RECRAFT_API_TOKEN="$(
op item get recraft.ai --fields label=password \
--format json | jq .value -r)"
curl https://external.api.recraft.ai/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $RECRAFT_API_TOKEN" \
-d '{
"prompt": "california brown pelican riding a bicycle",
"style": "vector_illustration",
"model": "recraftv3"
}'
Announcing FLUX1.1 [pro] and the BFL API (via) FLUX is the image generation model family from Black Forest Labs, a startup founded by members of the team that previously created Stable Diffusion.
Released today, FLUX1.1 [pro] continues the general trend of AI models getting both better and more efficient:
FLUX1.1 [pro] provides six times faster generation than its predecessor FLUX.1 [pro] while also improving image quality, prompt adherence, and diversity.
Black Forest Labs appear to have settled on a potentially workable business model: their smallest, fastest model FLUX.1 [schnell] is Apache 2 licensed. The next step up is FLUX.1 [dev] which is open weights for non-commercial use only. The [pro] models are closed weights, made available exclusively through their API or partnerships with other API providers.
I tried the new 1.1 model out using black-forest-labs/flux-1.1-pro on Replicate just now. Here's my prompt:
Photograph of a Faberge egg representing the California coast. It should be decorated with ornate pelicans and sea lions and a humpback whale.
The FLUX models have a reputation for being really good at following complex prompts. In this case I wanted the sea lions to appear in the egg design rather than looking at the egg from the beach, but I imagine I could get better results if I continued to iterate on my prompt.
The FLUX models are also better at applying text than any other image models I've tried myself.
System prompt for val.town/townie (via) Val Town (previously) provides hosting and a web-based coding environment for Vals - snippets of JavaScript/TypeScript that can run server-side as scripts, on a schedule or hosting a web service.
Townie is Val's new AI bot, providing a conversational chat interface for creating fullstack web apps (with blob or SQLite persistence) as Vals.
In the most recent release of Townie Val added the ability to inspect and edit its system prompt!
I've archived a copy in this Gist, as a snapshot of how Townie works today. It's surprisingly short, relying heavily on the model's existing knowledge of Deno and TypeScript.
I enjoyed the use of "tastefully" in this bit:
Tastefully add a view source link back to the user's val if there's a natural spot for it and it fits in the context of what they're building. You can generate the val source url via import.meta.url.replace("esm.town", "val.town").
The prompt includes a few code samples, like this one demonstrating how to use Val's SQLite package:
import { sqlite } from "https://esm.town/v/stevekrouse/sqlite";
let KEY = new URL(import.meta.url).pathname.split("/").at(-1);
(await sqlite.execute(`select * from ${KEY}_users where id = ?`, [1])).rows[0].id
It also reveals the existence of Val's very own delightfully simple image generation endpoint Val, currently powered by Stable Diffusion XL Lightning on fal.ai.
If you want an AI generated image, use https://maxm-imggenurl.web.val.run/the-description-of-your-image to dynamically generate one.
Here's a fun colorful raccoon with a wildly inappropriate hat.
Val are also running their own gpt-4o-mini proxy, free to users of their platform:
import { OpenAI } from "https://esm.town/v/std/openai";
const openai = new OpenAI();
const completion = await openai.chat.completions.create({
messages: [
{ role: "user", content: "Say hello in a creative way" },
],
model: "gpt-4o-mini",
max_tokens: 30,
});
Val developer JP Posma wrote a lot more about Townie in How we built Townie – an app that generates fullstack apps, describing their prototyping process and revealing that the current model it's using is Claude 3.5 Sonnet.
Their current system prompt was refined over many different versions - initially they were including 50 example Vals at quite a high token cost, but they were able to reduce that down to the linked system prompt which includes condensed documentation and just one templated example.
On being listed in the court document as one of the artists whose work was used to train Midjourney, alongside 4,000 of my closest friends (via) Poignant webcomic from Cat and Girl.
“I want to make my little thing and put it out in the world and hope that sometimes it means something to somebody else.
Without exploiting anyone.
And without being exploited.”
2023
I’ve resigned from my role leading the Audio team at Stability AI, because I don’t agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use’.
[...] I disagree because one of the factors affecting whether the act of copying is fair use, according to Congress, is “the effect of the use upon the potential market for or value of the copyrighted work”. Today’s generative AI models can clearly be used to create works that compete with the copyrighted works they are trained on. So I don’t see how using copyrighted works to train generative AI models of this nature can be considered fair use.
But setting aside the fair use argument for a moment — since ‘fair use’ wasn’t designed with generative AI in mind — training generative AI models in this way is, to me, wrong. Companies worth billions of dollars are, without permission, training generative AI models on creators’ works, which are then being used to create new content that in many cases can compete with the original works.
Now add a walrus: Prompt engineering in DALL‑E 3
Last year I wrote about my initial experiments with DALL-E 2, OpenAI’s image generation model. I’ve been having an absurd amount of fun playing with its sequel, DALL-E 3 recently. Here are some notes, including a peek under the hood and some notes on the leaked system prompt.
[... 3,505 words]Midjourney 5.1
Midjourney released version 5.1 of their image generation model on Tuesday. Here’s their announcement on Twitter—if you have a Discord account there’s a more detailed Discord announcement here.
[... 396 words]How I Used Stable Diffusion and Dreambooth to Create A Painted Portrait of My Dog (via) I like posts like this that go into detail in terms of how much work it takes to deliberately get the kind of result you really want using generative AI tools. Jake Dahn trained a Dreambooth model from 40 photos of Queso—his photogenic Golden Retriever—using Replicate, then gathered the prompts from ten images he liked on Lexica and generated over 1,000 different candidate images, picked his favourite, used Draw Things img2img resizing to expand the image beyond the initial crop, then Automatic1111 inpainting to tweak the ears, then Real-ESRGAN 4x+ to upscale for the final print.
My guess is that MidJourney has been doing a massive-scale reinforcement learning from human feedback ("RLHF") - possibly the largest ever for text-to-image.
When human users choose to upscale an image, it's because they prefer it over the alternatives. It'd be a huge waste not to use this as a reward signal - cheap to collect, and exactly aligned with what your user base wants.
The more users you have, the better RLHF you can do. And then the more users you gain.
— Jim Fan
From Deep Learning Foundations to Stable Diffusion. Brand new free online video course from Jeremy Howard: 30 hours of content, covering everything you need to know to implement the Stable Diffusion image generation algorithm from scratch. I previewed parts of this course back in December and it was fascinating: this field is moving so fast that some of the lectures covered papers that had been released just a few days before.
Stable Diffusion copyright lawsuits could be a legal earthquake for AI. Timothy B. Lee provides a thorough discussion of the copyright lawsuits currently targeting Stable Diffusion and GitHub Copilot, including subtle points about how the interpretation of “fair use” might be applied to the new field of generative AI.
I lost everything that made me love my job through Midjourney over night. A poster on r/blender describes how their job creating graphics for mobile games has switched from creating 3D models for rendering 2D art to prompting Midjourney v5 and cleaning up the results in Photoshop. “I am now able to create, rig and animate a character thats spit out from MJ in 2-3 days. Before, it took us several weeks in 3D. [...] I always was very sure I wouldn’t lose my job, because I produce slightly better quality. This advantage is gone, and so is my hope for using my own creative energy to create.”
Web Stable Diffusion (via) I just ran the full Stable Diffusion image generation model entirely in my browser, and used it to generate an image (of two raccoons eating pie in the woods, see “via” link). I had to use Google Chrome Canary since this depends on WebGPU which still isn’t fully rolled out, but it worked perfectly.
ControlNet (via) A spectacular step forward in image generation—using “conditional control” to control models like Stable Diffusion. The README here is full of examples of what this enables. Extremely finely grained control of generated images based on a sketch, or in input image—including tricks like using Canny edge detection (an algorithm from 1986) to convert any image into an outline which can then be used as input to the model.
2022
Stable Diffusion 2.0 and the Importance of Negative Prompts for Good Results. Stable Diffusion 2.0 is out, and it’s a very different model from 1.4/1.5. It’s trained using a new text encoder (OpenCLIP, in place of OpenAI’s CLIP) which means a lot of the old tricks—notably using “Greg Rutkowski” to get high quality fantasy art—no longer work. What DOES work, incredibly well, is negative prompting—saying things like “cyberpunk forest by Salvador Dali” but negative on “trees, green”. Max Woolf explores negative prompting in depth in this article, including how to combine it with textual inversion.
The AI that creates any picture you want, explained. Vox made this explainer video about text-to-image generative AI models back in June, months before Stable Diffusion was released and shortly before the DALL-E preview started rolling out to a wider audience. It’s a really good video—in particular the animation that explains at a high level how diffusion models work, which starts about 5m30s in.
The Illustrated Stable Diffusion (via) Jay Alammar provides a detailed, clearly explained description of how the Stable Diffusion image generation model actually works under the hood..
I Resurrected “Ugly Sonic” with Stable Diffusion Textual Inversion (via) “I trained an Ugly Sonic object concept on 5 image crops from the movie trailer, with 6,000 steps [...] (on a T4 GPU, this took about 1.5 hours and cost about $0.21 on a GCP Spot instance)”
The Changelog: Stable Diffusion breaks the internet. I’m on this week’s episode of The Changelog podcast, talking about Stable Diffusion, AI ethics and a little bit about prompt injection attacks too.
Exploring the training data behind Stable Diffusion
Two weeks ago, the Stable Diffusion image generation model was released to the public. I wrote about this last week, in Stable Diffusion is a really big deal—a post which has since become one of the top ten results for “stable diffusion” on Google and shown up in all sorts of different places online.
[... 2,897 words]For these reasons, I don’t think I’ll be using Midjourney or any similar tool to illustrate my newsletter going forward (an exception would be if I were writing about the technology at a later date and wanted to show examples). Even though the job wouldn’t go to a different, deserving, human artist, I think the optics are shitty, and I do worry about having any role in helping to set any kind of precedent in this direction.
Grokking Stable Diffusion (via) Jonathan Whitaker built this interactive Jupyter notebook that walks through how to use Stable Diffusion from Python step-by-step, and then dives deep into helping understand the different components of the implementation, including how text is encoded, how the diffusion loop works and more. This is by far the most useful tool I’ve seen yet for understanding how this model actually works. You can run Jonathan’s notebook directly on Google Colab, with a GPU.
Run Stable Diffusion on your M1 Mac’s GPU. Ben Firshman provides detailed instructions for getting Stable Diffusion running on an M1 Mac.
Stable Diffusion is a really big deal
If you haven’t been paying attention to what’s going on with Stable Diffusion, you really should be.
[... 1,443 words]Stable Diffusion Public Release (via) New AI just dropped. Stable Diffusion is similar to DALL-E, but completely open source and with a CC0 license applied to everything it generates. I have a Twitter thread (the via) link of comparisons I’ve made between its output and my previous DALL-E experiments. The announcement buries the lede somewhat: to try it out, visit beta.dreamstudio.ai—which you can use for free at the moment, but it’s unclear to me how billing is supposed to work.
First impressions of DALL-E, generating images from text
I made it off the DALL-E waiting list a few days ago and I’ve been having an enormous amount of fun experimenting with it. Here are some notes on what I’ve learned so far (and a bunch of example images too).
[... 2,102 words]