Highlights:

  • With Gemini 1.5 Flash-8B, Google is presenting one of the most cost-effective lightweight LLMs currently available on the market.
  • Gemini 1.5 Flash-8B is freely accessible to developers via the Gemini API and Google AI Studio.

Google LLC is releasing a new version of its popular Gemini 1.5 Flash artificial intelligence model that is smaller and faster than the original.

This version, called Gemini 1.5 Flash-8B, is much more affordable, priced at half the cost of its predecessor. Gemini 1.5 Flash is a lightweight iteration of Google’s Gemini family of large language models, optimized for speed and efficiency and designed for deployment on low-powered devices like smartphones and sensors.

The company unveiled Gemini 1.5 Flash at Google I/O 2024 in May, and it was made available to select paying customers a few weeks later. It subsequently became accessible for free through the Gemini mobile app, though with certain usage restrictions.

It officially became widely available at the end of June, featuring competitive pricing and a one million-token context window alongside high-speed processing. At its launch, Google highlighted that its input size is 60 times greater than that of OpenAI’s GPT-3.5 Turbo and is, on average, 40% faster.

The initial version was designed with a very low token input cost, making it highly competitive for developers. It was adopted by customers like Uber Technologies Inc., powering the Eats AI assistant for the company’s UberEats food delivery service.

With Gemini 1.5 Flash-8B, Google is launching one of the most affordable lightweight LLMs on the market, featuring a 50% lower price and double the rate limits compared to the original 1.5 Flash. Additionally, the company stated that it provides reduced latency on smaller prompts.

Gemini 1.5 Flash-8B is available to developers for free via the Gemini API and Google AI Studio.

In a blog post, Logan Kilpatrick, Senior Product Manager for the Gemini API, stated that the company has made “significant progress” in enhancing 1.5 Flash by considering feedback from developers and “testing the limits” of what can be achieved with lightweight LLMs.

He noted that the company introduced an experimental version of Gemini 1.5 Flash-8B last month. Since then, it has undergone further refinement and is now available for general production use.

Kilpatrick stated that the 8-B version nearly matches the performance of the original 1.5 Flash model across many key benchmarks and has proven particularly effective in tasks like chat, transcription, and long-context language translation.

“Our release of best-in-class small models continues to be informed by developer feedback and our own testing of what is possible with these models. We see the most potential for this model in tasks ranging from high volume multimodal use cases to long context summarization tasks,” Kilpatrick added.

The pricing is competitive with similar models from OpenAI and Anthropic PBC. For OpenAI, the least expensive model remains GPT-4o mini, priced at USD 0.15 per million inputs, though this cost is halved for reused prompt prefixes and batched requests. In contrast, Anthropic’s most affordable model, Claude 3 Haiku, costs USD 0.25 per million inputs, but the price decreases to USD 0.03 per million for cached tokens.

Furthermore, Kilpatrick mentioned that the company is doubling the rate limits of 1.5 Flash-8B to enhance its utility for straightforward, high-volume tasks. Consequently, developers can now submit up to 4,000 requests per minute, he stated.