Back to Blog

Gemini Embedding 2: How to Build Multimodal RAG Right

10 min read

Google's Gemini Embedding 2: How to Build Multimodal RAG That Actually Works

Gemini Embedding 2 lets you embed videos, images, audio, and documents directly into a vector database — but embedding a video is not the same as being able to analyze that video inside your RAG system. Most tutorials showing off this new model skip a critical architectural step, leaving you with a system that just hands back video clips instead of actual answers. I built a working multimodal RAG architecture that fixes this, and I'm going to walk you through exactly how it works and why the naive approach falls short.

What Is Gemini Embedding 2 and Why Does It Matter?

Gemini Embedding 2 is Google's first natively multimodal embedding model. It maps five distinct data types — text, images, video, audio, and PDFs — into a single unified vector space. Before this, we were basically stuck embedding text. If you wanted to work with video in a RAG system, you had to do hacky workarounds like generating text descriptions of the video and embedding those instead.

Now, the video itself gets embedded. That's a massive shift.

Here are the key specs:

  • Input types: Text, images (PNG/JPEG), video (MP4/MOV up to 120 seconds), audio, and PDF documents
  • Text token limit: 8,192 tokens (a 4x increase over the previous model)
  • Output dimensions: Flexible from 128 to 3,072, with recommended sizes of 768, 1,536, and 3,072
  • Availability: Public preview via Gemini API and Vertex AI

The business implications are huge. Think about organizations sitting on mountains of proprietary video data — training recordings, security footage, product demos, client calls. Before Gemini Embedding 2, analyzing that data through RAG was a nightmare. Now it's a real possibility. But only if you set up the architecture correctly.

Why Does a Naive Multimodal RAG Setup Fail With Video?

Here's where most people get this wrong, and it's the thing nobody is talking about in the Gemini Embedding 2 hype cycle.

The intuitive assumption goes like this: "I can embed videos now, so I'll embed a video, ask questions about it, and get detailed text answers." That sounds right. It's also wrong.

To understand why, you need to understand what embedding models actually do. An embedding model takes data and converts it into a vector — a string of numbers representing a point in high-dimensional space. When we use 1,536 dimensions (one of the recommended sizes for Gemini Embedding 2), your document becomes 1,536 numbers that represent where it sits semantically relative to everything else in your database.

Think of it like a graph from geometry class, except instead of two axes, you have hundreds or thousands of dimensions. A document about World War II battleships lands near vectors about ships and naval history. It doesn't land near bananas and green apples. That's semantic similarity in action.

Here's the traditional flow with text:

  1. You send a text document through the embedding model
  2. It becomes a vector (a point in space with associated numbers)
  3. The document gets stored alongside that vector
  4. When you ask a question, your question also becomes a vector
  5. The system finds the closest vectors to your question
  6. It grabs the paired documents and the LLM ingests that text to build an answer

This works beautifully with text because the LLM can actually read and process text documents. But when the paired content is a video file — an MP4 — most LLMs can't ingest and analyze that on the fly. So instead of a thoughtful, detailed answer, you get a two-minute video clip and essentially a "the answer's in there somewhere, good luck" response.

That's what every basic Gemini Embedding 2 tutorial is actually building. There's some value in video retrieval by itself, sure. But it's not the system people think they're getting.

How Do You Build a Multimodal RAG System That Actually Answers Questions About Videos?

The fix is conceptually simple once you see the problem. You don't just embed the video — you also generate a text description and transcript of the video at ingestion time, and store that alongside the vector.

This way, when a query pulls a video vector from the database, the LLM doesn't just get an MP4 file. It gets accompanying text it can actually process and use to generate a real answer.

Here's the architecture:

  1. Video comes in for ingestion
  2. The video goes through Gemini Embedding 2 to create the vector (for search/retrieval purposes)
  3. Simultaneously, the video gets sent through a Gemini model (like Gemini Flash) which generates a text description, transcript, and analysis
  4. Both the video file AND the text description get paired with the vector in the database
  5. At query time, when the vector gets pulled, the LLM has text it can actually work with to augment its answer — plus the video itself for the user to reference

The critical insight: we do this text generation on the front end during ingestion, not on the back end at query time. You don't want Gemini re-analyzing a video every single time someone asks a question. That's slow and expensive. You figure it out once when the video enters the system.

I tested this architecture side by side. Same RAG system, same question about Playwright browser automations from an embedded video. With the proper architecture (Gemini analysis included), I got a full text description, properly chunked source references, and a media preview of the video. Without it — the naive approach — the answer was essentially "insufficient information to explain this, here are some source files." Night and day difference.

What's the Deal With Video Chunking in RAG?

Chunking is a concept anyone who's worked with text-based RAG knows well. You can't shove an entire document into the system at once — there are limits. So you break it into chunks at intelligent breakpoints.

Video chunking is the same problem, but harder, and it's not a solved problem yet.

Consider an hour-long video. Where do you cut it? With text, you can at least chunk by paragraph, section, or semantic meaning. With video, the options are less clear. Do you chunk based on the transcript? Based on scene changes? Based on arbitrary time intervals?

Gemini Embedding 2 itself has a 120-second video input limit, which forces chunking regardless. The approach I used in my GitHub repo is straightforward: Claude Code automatically chops the video into two-minute segments with a 30-second overlap between segments. It's a simplistic method — basically the video equivalent of fixed-size text chunking with overlap — but it's a proven pattern from the text world.

This is an area ripe for improvement. I'd expect to see more sophisticated video chunking strategies emerge, probably incorporating:

  • Transcript-based semantic chunking
  • Scene detection algorithms
  • Re-rankers to improve retrieval accuracy across chunks
  • Hybrid approaches combining multiple signals

If you're building this for production, video chunking strategy is where you should spend your experimentation time.

How Do You Set Up This Multimodal RAG System Yourself?

I published a GitHub repo with the full architecture. Two ways to get started:

Option 1: Clone and point Claude Code at it. Clone the repo, open Claude Code, point it at the project, and tell it to recreate it. Claude Code handles the rest.

Option 2: Use the Claude Code blueprint. The repo includes a markdown file called claude code blueprint. Copy the entire thing, paste it into Claude Code, and it will build the system for you.

Prerequisites you'll need:

  • Updated Python installation
  • FFmpeg (for video processing)
  • Supabase CLI (used for the vector database)
  • A Gemini API key
  • A Supabase project

For Supabase specifically, the free tier is more than enough. Here's where to find your keys:

  1. Create an account at supabase.com
  2. Create a project
  3. Go to Project Settings → API Keys → Legacy anon role API keys for your public API key
  4. Go to Connect, scroll down to find your Supabase URL
  5. If it's your first time with Supabase CLI, it'll ask for an access token — Claude Code will give you the link

Note: I used Supabase for the vector database, but you could swap it for Pinecone, Qdrant, or whatever you prefer. The architecture is the same regardless of your vector store.

Claude Code will literally hold your hand through the setup, including installing the Supabase CLI, creating database tables, and handling all the SQL. You don't need to write any database code manually.

What Does the Finished System Look Like in Practice?

The system Claude Code builds includes a basic UI where you can upload files (videos, images, documents), query across all media types, and see results that include both text answers and media previews.

When you ask a question like "How can I run multiple browser automations at the same time?" the system:

  • Returns a full text answer synthesized from the embedded video's generated description
  • Shows matched source chunks so you can see what it pulled from
  • Displays a video preview you can actually watch
  • Includes matched images from embedded image content

Being able to get text answers accompanied by the actual source video and images is something you simply could not do before Gemini Embedding 2. This is a massive leap in RAG functionality — as long as the architecture accounts for the text generation step.

The UI is deliberately bare-bones. This is a 90% solution — a working skeleton you can build on. Things I'd recommend adding before production:

  • Better video chunking strategy
  • Data cleanup workflows (what happens when you update or delete a document?)
  • A more polished upload interface (currently handles one file at a time)
  • Re-ranking to improve retrieval quality

Why Can't the Embedding Model Just Explain the Video?

This is the question I get the most when I explain this architecture, and it's a good one.

If Gemini Embedding 2 can "understand" the video well enough to embed it semantically, why can't it also provide an explanation?

Because that's not what embedding models do. An embedding model's job is to take data and turn it into a vector — a numerical representation of semantic meaning. It's optimized for similarity and search, not explanation.

Think of it like the difference between recognizing a face in a crowd and being able to describe who that person is, what they do, and everything about them. The recognition part (embedding) and the description part (generation) are fundamentally different capabilities handled by different types of models.

That's why we need a generative model like Gemini Flash as a companion to the embedding model. The embedding model handles the "find relevant content" job. The generative model handles the "explain what's in this content" job. Both are essential for a multimodal RAG system that actually works.

FAQ

Can Gemini Embedding 2 handle videos longer than 2 minutes?

Not directly. Gemini Embedding 2 has a 120-second (2-minute) limit per video input. For longer videos, you need a chunking strategy that breaks the video into segments. The approach I use is two-minute chunks with 30-second overlaps, but more sophisticated methods using transcript-based semantic chunking are worth exploring.

Do I need Gemini as my LLM to use Gemini Embedding 2?

No. You can use Gemini Embedding 2 for the embedding step and any LLM you want for the generation step. That said, you do need a Gemini model (like Gemini Flash) during the ingestion pipeline to generate text descriptions of non-text media. At query time, any LLM that can process text can handle the response generation.

Is Supabase required for this architecture?

Not at all. I used Supabase because it's free, has solid pgvector support, and the CLI makes setup easy. You could swap in Pinecone, Qdrant, Weaviate, or any other vector database. The architectural pattern — generating text descriptions at ingestion time — works the same regardless of your vector store choice.

How is this different from just transcribing a video and embedding the transcript?

The old approach of transcribing and embedding text loses all visual information. Gemini Embedding 2 embeds the actual video, meaning the vector captures visual content, on-screen text, diagrams, and other non-verbal information. Combined with the Gemini-generated text description (which also analyzes visual elements), you get much richer retrieval than transcript-only approaches.

Is this architecture production-ready?

It's a 90% solution — a working skeleton that demonstrates the correct architectural pattern. For production, you'd want to invest in better video chunking strategies, data lifecycle management (updates and deletes), re-ranking for retrieval accuracy, and a proper UI. But the core pipeline is solid and gives you the right foundation to build on.


If you want to go deeper into multimodal RAG and AI development, join the free Chase AI community for templates, prompts, and live breakdowns. And if you're serious about building with AI, check out the paid community, Chase AI+, for hands-on guidance on how to make money with AI.