Context caching for Google Gemini (via) Another new Gemini feature announced today. Long context models enable answering questions against large chunks of text, but the price of those long prompts can be prohibitive - $3.50/million for Gemini Pro 1.5 up to 128,000 tokens and $7/million beyond that.
Context caching offers a price optimization, where the long prefix prompt can be reused between requests, halving the cost per prompt but at an additional cost of $4.50 / 1 million tokens per hour to keep that context cache warm.
Given that hourly extra charge this isn't a default optimization for all cases, but certain high traffic applications might be able to save quite a bit on their longer prompt systems.
It will be interesting to see if other vendors such as OpenAI and Anthropic offer a similar optimization in the future.
Update 14th August 2024: Anthropic's Claude now has its own version of prompt caching.
Recent articles
- Notes on using LLMs for code - 20th September 2024
- Things I've learned serving on the board of the Python Software Foundation - 18th September 2024
- Notes on OpenAI's new o1 chain-of-thought models - 12th September 2024