Gemini Omni: Google's Multimodal Video Model That Lets You Edit With Conversation

Google announced Gemini Omni at I/O 2026, and it represents a meaningful shift in how the company is approaching AI video. The first model in the family, Gemini Omni Flash, is rolling out now to Gemini Plus, Pro and Ultra subscribers, and is also available at no cost on YouTube Shorts and the YouTube Create app starting this week.

Veo, Google’s previous video generation model in the Gemini app, is being replaced. Omni is what comes next.

What Omni actually does differently

Most multimodal AI systems today are hybrid arrangements: separate models handling separate inputs, stitched together behind a clean interface. Omni is built differently. Google trained it as a single unified architecture where text, image, video, audio and code are all treated as first-class inputs from the start, not bolted on afterwards.

The practical result is that you can mix reference materials in a single prompt. Drop in a photo, some audio, a short video clip and a text description, and Omni generates a coherent output that draws on all of them. That is genuinely new compared to what Veo offered.

The other headline capability is conversational editing. Rather than re-prompting from scratch each time you want to adjust something, Omni retains context across a sequence of instructions. You can tell it to add a cinematic zoom, then swap the background, then adjust the lighting, and it keeps track of what you wanted to preserve. This matters because one of the persistent frustrations with AI video tools has been that small edits can unintentionally break things elsewhere in the clip. Omni is designed to hold continuity across characters, scenes and structure as you work through changes.

One practical note: editing prompts will need to be reasonably specific. Vague instructions risk over-editing, with Omni changing things you intended to keep. That is worth bearing in mind if you are experimenting early on.

What you get right now

Omni Flash generates videos up to 10 seconds long. Google has been clear that this is a deliberate rollout decision rather than a technical ceiling. Longer durations are coming, but the thinking is that most users are not making longer videos yet, and Google wanted broader access sooner.

The feature set at launch includes:

Image to video and video to video editing via conversational prompts
AI Avatars, a way to create a digital version of yourself that can appear in generated videos. It is entirely optional, requires a dedicated onboarding process involving recording yourself and speaking a series of numbers to guard against misuse, and only you can use your own avatar.
Google Flow Music integration, where you can use Omni conversationally to direct music videos that match the narrative and pacing of a track.
YouTube Shorts Remix, which lets you select an eligible Short, prompt the changes you want, and get an edited version back.

For developers and enterprise customers, API access is coming in the next few weeks with the same interface teams already use for Gemini. No migration required.

The physics and world knowledge angle

One thing worth highlighting beyond the obvious creative tools: Omni incorporates Gemini’s knowledge of physics, history, science and cultural context into how it generates video. It is not just rendering scenes that look plausible, it is reasoning about what should happen based on how the world actually works. Gravity, fluid dynamics, kinetic energy, cause and effect across a scene. Google describes this as bridging photorealism with meaningful storytelling, and it is the part that separates Omni from tools that produce visually impressive but physically incoherent output.

Safety and verification built in

Every video generated by Omni ships with two layers of provenance. SynthID is an invisible watermark embedded into the pixels at generation time, designed to survive cropping, filters and re-encoding. C2PA content credentials sit alongside it as a signed cryptographic manifest attached to the file itself.

You can verify whether a video was made with Omni through the Gemini app, Gemini in Chrome, and Google Search. Google is also expanding Content Credentials verification across its products, so you will be able to see whether content originated from a camera or was AI-generated, and whether generative tools were used in editing.

This is not a minor detail. Research shows people correctly identify high-quality AI-generated video roughly a quarter of the time. Google is putting significant weight behind provenance tools, and it is worth noting that OpenAI, Kakao and Eleven Labs are also adopting SynthID, which gives the watermarking system more practical reach across the industry.

What is still coming

Omni Flash is the first model in the family. Omni Pro is in development and will target more professional use cases, though Google has not committed to a release date beyond saying it will launch when the performance represents a genuine step up from Flash.

Audio input at launch is limited to voice references. Other audio input types are on the roadmap. And while video is the first output modality, Google has said image and text outputs will be enabled over time.

There are also no published benchmarks yet. Google led the announcement with capability claims and demonstrations rather than standardised evaluation scores. Independent third-party assessments will take a few more weeks to arrive, so treat the performance claims as directional for now.

The bigger picture

Google built Gemini from the start to handle multiple modalities in a single model. Omni is the video chapter of that story. The free rollout on YouTube Shorts is not incidental: Google is clearly thinking about the creator pipeline and what happens when a significant volume of YouTube content is being produced with or refined by Omni.

For anyone making content, editing footage, building creative workflows or developing products that involve video, this is worth paying close attention to. The API rollout in the coming weeks will be the moment developers can start understanding what is actually possible under the hood.