Gemini API File Search now handles images, custom metadata, and page-level citations

Google has updated the Gemini API File Search tool with three meaningful additions: multimodal retrieval, custom metadata filtering, and page-level citations. Taken together, they move File Search from a useful text retrieval shortcut into something that can genuinely handle the messy, mixed-format content libraries most real organisations actually have.

What changed

Multimodal support is the headline addition. File Search can now process images and text in the same retrieval pipeline, powered by Gemini Embedding 2, Google’s first embedding model to map text, images, video, audio, and documents into a single shared vector space. In practice, this means you can upload a PDF alongside product photos, scanned diagrams, or slide exports and query across all of them in one pass, without a separate OCR step or a second pipeline stitching things together.

For image retrieval specifically, the API currently supports PNG and JPEG files (up to 4K x 4K pixels, up to 6 images per request). Audio and video formats are not yet supported within File Search stores, even though Gemini Embedding 2 supports them more broadly.

Custom metadata filtering lets you tag files with labels of your choosing, such as department, status, document type, or date range, and then narrow queries to specific subsets of your index at retrieval time. If you have a 50,000-document knowledge base and a user query that only applies to legal documents from Q1, you no longer have to surface and re-rank everything. You filter first, retrieve second.

Page-level citations close the loop on verifiability. Every response now includes grounding metadata that ties the answer to a specific document and page number. For multimodal stores, that also includes downloadable image references. The citation information comes back via the grounding_metadata attribute on the response object, so it is straightforward to surface in your UI.

What this means for developers

Before this update, building a retrieval-augmented generation system that handled both text and images meant running separate embedding pipelines, managing different vector stores, and writing the logic to merge results before passing them to a model. That is not an impossible problem, but it is a meaningful amount of infrastructure to maintain, and it introduces points of failure.

File Search handles chunking, embedding, storage, and retrieval as a managed service. You upload your files via the API, configure your metadata schema, and query against it. Google handles the rest. The practical result is that a working multimodal RAG setup is now closer to dozens of lines of code than months of infrastructure work.

The page-level citations are worth paying attention to specifically. For anyone building tools where users need to verify answers, such as legal research, compliance review, or internal knowledge bases, being able to point to page 14 of a specific document rather than just naming the file is a meaningful difference in usability and trust.

Real-world examples

Google highlighted a few early adopters. K-Dense Web is using File Search for scientific research, where material routinely mixes charts, figures, and written text in the same documents. Klipy is using it to improve text recognition inside image-heavy GIF libraries, a use case where conventional document search would simply miss embedded visual content. Harvey, a legal research platform, reported a 3% improvement in Recall@20 precision on legal benchmarks after adopting Gemini Embedding 2, which translates to more accurate citations and fewer missed references.

These are not abstract demos. They reflect the core problem File Search is now better equipped to solve: real content does not arrive cleanly formatted as plain text.

A few things to know before you build

If you already have a text-only File Search index using gemini-embedding-001, you will need to re-embed it to use the multimodal features. Multimodal stores require gemini-embedding-2, and the two models use different embedding spaces, so they are not compatible.

On pricing, file storage and query-time embedding generation remain free. You pay for the initial indexing embeddings and normal Gemini model input and output tokens. Google recommends keeping individual stores under 20 GB for best retrieval performance and lower latency.

The tool is compatible with Gemini 2.5 Pro, Gemini 2.5 Flash-Lite, Gemini 3 Flash Preview, Gemini 3.1 Pro Preview, and Gemini 3.1 Flash-Lite Preview.

Google has also published a demo app in AI Studio that lets you upload PDFs and images and query across them with citations and page numbers returned in real time, which is a useful starting point if you want to see the feature in action before writing any code.

The bigger picture

File Search launched in November 2025 as a managed RAG service with text retrieval and straightforward pricing. This update is a substantial expansion of that foundation. The combination of a unified embedding space for text and images, metadata filtering for precision, and page-level citations for verifiability addresses three of the most common reasons RAG implementations fall short in production: they can not handle visual content, they retrieve too broadly, and they do not give users a way to check the answer.

None of these problems are new. What is new is that Google is now handling them at the infrastructure level, so individual developers do not have to solve them from scratch each time. For teams building internal tools on mixed business content, that is a practical shift in how much work it takes to get something reliable into production.