24th January 2024 - Link Blog
Google Research: Lumiere. The latest in text-to-video from Google Research, described as “a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion”.
Most existing text-to-video models generate keyframes and then use other models to fill in the gaps, which frequently leads to a lack of coherency. Lumiere “generates the full temporal duration of the video at once”, which avoids this problem.
Disappointingly but unsurprisingly the paper doesn’t go into much detail on the training data, beyond stating “We train our T2V model on a dataset containing 30M videos along with their text caption. The videos are 80 frames long at 16 fps (5 seconds)”.
The examples of “stylized generation” which combine a text prompt with a single reference image for style are particularly impressive.
Recent articles
- I vibe coded my dream macOS presentation app - 25th February 2026
- Writing about Agentic Engineering Patterns - 23rd February 2026
- Adding TILs, releases, museums, tools and research to my blog - 20th February 2026