Gemini 3.1 Flash-Lite Is Now Generally Available — Google's Cheapest Production-Ready Model in the 3.x Family

Google’s Gemini 3.1 Flash-Lite moved from preview to general availability on May 7, 2026, and with it comes locked-in production pricing and a clear deprecation deadline for anyone still on the preview endpoint.

The headline numbers: $0.25 per million input tokens, $1.50 per million output tokens. That makes it the cheapest GA model in the Gemini 3.x family, and by a significant margin. Gemini 3.5 Flash, which reached GA shortly after on May 19, comes in at $1.50 input and $9.00 output. Flash-Lite is six times cheaper on output tokens.

What’s Actually in the Box

Flash-Lite is positioned as Google’s speed-and-cost workhorse for high-volume tasks. The model supports text, image, video, audio, and PDF inputs, with a 1,048,576-token context window and a maximum output of 65,536 tokens. Knowledge cutoff is January 2025.

One detail worth understanding: Flash-Lite ships with four configurable thinking levels (minimal, low, medium, high), and it defaults to minimal. That’s a deliberate choice. Where other Gemini 3 models default to high, Flash-Lite is tuned to be fast and cheap by default, with the option to dial up reasoning when a specific task demands it. A single deployment can handle mixed workloads without routing traffic across different models.

On the benchmarks side, Artificial Analysis reports 2.5x faster Time to First Answer Token and 45% better output speed compared to Gemini 2.5 Flash, while maintaining comparable or better quality. It also scores 86.9% on GPQA Diamond, 76.8% on MMMU Pro, and sits at an Elo of 1432 on the Arena.ai leaderboard.

Google describes it as architecturally based on Gemini 3 Pro and trained on TPUs.

The Deprecation Timeline You Need to Know

If you’re using gemini-3.1-flash-lite-preview, here are the two dates that matter:

Gemini API (ai.google.dev): Preview endpoint shut down May 25, 2026
Vertex AI / Google Cloud: Preview endpoint shuts down July 9, 2026

The Gemini API deadline has already passed. If your application is still pointing at gemini-3.1-flash-lite-preview on the Gemini API side, it’s already broken. The Vertex AI window gives you until early July, but that’s not a reason to wait.

The migration itself is straightforward. Update your model ID from gemini-3.1-flash-lite-preview to gemini-3.1-flash-lite and you’re done. The GA model is a direct replacement.

What Does This Mean for You?

The practical implications split a few ways depending on where you sit.

If you’re running high-volume, latency-sensitive workloads, Flash-Lite is worth a serious look if you haven’t already. Customer support pipelines, document classification, structured data extraction, real-time chat responses — these are the use cases it’s designed around. Gladly, for example, runs its core text-channel AI agent on Flash-Lite across SMS, WhatsApp, and Instagram, handling millions of interactions weekly at roughly 60% lower cost than comparable thinking-tier models.

If you’re building developer tools, JetBrains has noted that the balance of speed and intelligence makes it well-suited to real-time code completion and agentic tooling. The low latency matters a lot when users are waiting on in-editor suggestions.

If you’re deciding between Flash-Lite and 3.5 Flash, the honest answer is that they’re not competing for the same jobs. Flash-Lite is for tasks where cost and speed are the primary constraints and you don’t need deep reasoning. Gemini 3.5 Flash is for more complex tasks where you’re willing to pay for better reasoning quality. The six-times price difference is real, and so is the capability gap. Pick based on what your task actually requires.

If you’re on 2.0 Flash-Lite or 2.5 Flash-Lite, Google says 3.1 Flash-Lite delivers a meaningful quality improvement over both, with performance in some benchmark areas matching 2.5 Flash. That’s a notable step up for a model at this price point.

The Broader Context

The GA release puts Flash-Lite in an interesting position within the Gemini 3.x family. It launched into preview on March 3, 2026, spent about two months in testing, and has now graduated with locked pricing. The two-month preview-to-GA cycle is relatively quick, and the simultaneous expansion with 3.5 Flash signals that Google is building out a proper tiered model family rather than just offering a single all-purpose option.

For developers, a GA release means pricing stability. Preview models can change without notice. With GA, you can build cost models and SLAs around the numbers, which is a meaningful operational difference even if the model itself hasn’t changed.

Full details are available in the Gemini API changelog, the model card from Google DeepMind, and the deprecations page for migration specifics.