RAG-Anything + Claude Code: Multimodal RAG Setup Guide

Most RAG systems can only handle text. Feed them a scanned PDF, a bar chart, or a LaTeX equation and they choke. RAG-Anything fixes this — it sits on top of LightRAG and adds multimodal ingestion for images, charts, graphs, tables, and equations. Once it's installed, Claude Code can pull specific numbers out of bar charts inside PDFs you never had to manually transcribe. I'll walk through how it works under the hood, how to install it, and the one ingestion quirk you need to know about.

RAG-Anything comes from the same team that built LightRAG. It plugs directly into an existing LightRAG instance, uses the same knowledge graph, and answers queries through the same API. If you already have LightRAG running from yesterday's setup, adding RAG-Anything is a drop-in upgrade.

What is RAG-Anything and how does it extend LightRAG?

RAG-Anything is a multimodal wrapper around LightRAG that handles non-text documents — scanned PDFs, charts, tables, LaTeX equations, and images. LightRAG by itself only works on text. RAG-Anything parses visual and non-text content into a format LightRAG can index, then merges everything into a single knowledge graph and vector database.

From the user's perspective, nothing changes. You still query through the same LightRAG API. You still hit the same knowledge graph. The only real difference is that your queries can now pull answers out of documents that LightRAG on its own would have ignored.

The trade-off is weight. RAG-Anything ships with local models (including MinerU) that have to run on your machine. That means more disk space, slower ingestion on CPU, and a separate ingestion path that requires a Python script rather than the web UI upload button.

How does RAG-Anything actually work under the hood?

This is worth understanding because it shows why it's both powerful and a little slower than plain LightRAG.

When you ingest a non-text document, RAG-Anything runs it through a pipeline:

MinerU parses the document into components. MinerU is an open-source document parser that identifies the structure — "this is a header, this is a paragraph, this is a chart, this is a LaTeX equation." It doesn't understand the content yet. It just boxes things up.
Specialized local models extract text from each component. PaddleOCR handles text blocks. A separate model handles LaTeX equations. These models run locally on your computer — no API call, no cost.
Anything that can't be converted to text gets screenshotted. Charts, bar graphs, complex diagrams — RAG-Anything takes an image of them and puts them in the "image bucket."
The text bucket gets sent to an LLM for entity + embedding extraction. By default this is GPT-5.4 nano in the install script. The LLM extracts entities, relationships, and embeddings from the text.
The image bucket gets sent to the same LLM as screenshots. The LLM does OCR-style extraction plus entity/relationship pulling from the visuals.
Everything merges into one vector database and one knowledge graph. Then that merges again with your existing LightRAG store.

The smart part is the split. Instead of throwing every page at GPT as a screenshot — which would be slow and expensive — RAG-Anything uses local models to extract text cheaply, and only sends images to the LLM when there's no other option. That's how it stays affordable at scale.

How do you install RAG-Anything on top of LightRAG?

The install is a one-shot Claude Code prompt. Three things have to happen:

Update the storage path. Your existing Docker LightRAG instance has its own storage config. RAG-Anything needs to share it so the knowledge graphs merge correctly.
Update the models in the example scripts. The GitHub repo's example scripts default to older models like GPT-4o-mini. Swap them for GPT-5.4 nano (or whatever you prefer) and keep text-embedding-3-large for embeddings so everything uses the same OpenAI pipeline.
Fix the embedding double-wrap bug. Because RAG-Anything wraps LightRAG, some example scripts have an embedding call that gets double-wrapped and fails. Claude Code will fix this automatically if you point it at the error.

The full install prompt is inside the free Chase AI community — search "RAG-Anything" there. It downloads MinerU and all its model dependencies, which is why the install is heavier than plain LightRAG.

Heads up: by default MinerU runs on CPU. If ingestion feels too slow, tell Claude Code to switch MinerU to GPU. It'll install the right PyTorch version and reconfigure for you. Not required, but dramatically faster for bulk ingestion.

How do you ingest multimodal documents with RAG-Anything?

This is the one real annoyance. Plain LightRAG lets you drag a text document into the web UI and hit upload. RAG-Anything can't do that — multimodal ingestion has to go through a Python script.

The fix is to wrap the script as a Claude Code skill. Once you have the skill installed in your .claude/ folder, ingestion becomes:

"Claude Code, use the RAG-Anything skill to upload this PDF."

Claude Code runs the script, which runs MinerU, which extracts components, which get sent through the pipeline described above. When it's done, the skill automatically restarts the Docker container so LightRAG picks up the new data. From your end, it's one command.

The skill is also in the free community alongside the install prompt. Drop it in .claude/skills/ and it's ready to use.

What does RAG-Anything let you query that LightRAG can't?

Here's a concrete example from the video. I ingested a PDF called "Novatech SaaS Revenue Analysis" — totally fake company, but the document contains a sorted bar chart of monthly revenue from January through September 2025. Plain LightRAG would see the chart as an unparseable image and skip it.

With RAG-Anything, I asked through Claude Code: "Query the LightRAG database about the monthly revenue trend for Novatech Inc. January through September 2025."

The response pulled the exact monthly numbers out of the bar chart — January 4.6, February 4.9, March 5.4, and so on through September. Those numbers were never typed out in the PDF as text. They only existed as bars on a chart. MinerU screenshotted the chart, GPT-5.4 extracted the values, and they went into the knowledge graph as structured data.

That's the kind of query that used to require a human opening the PDF and reading off the chart. Now it's one Claude Code command.

Is RAG-Anything worth the extra setup over plain LightRAG?

Short answer: yes, if your documents contain anything visual. Charts, tables, scanned PDFs, financial reports, research papers with figures — any of that. The ingestion pipeline is heavier, but the query experience is identical, and you get access to a huge category of data that plain LightRAG just ignores.

If you're only working with pure text documents — blog posts, transcripts, plain markdown — you don't need it. LightRAG on its own is faster and lighter for that use case.

The honest read: install it anyway. Most real-world document sets have at least some visual content, and the moment you run into a scanned PDF or a chart-heavy report, you'll want RAG-Anything already configured. The install is a one-time cost. The capability is permanent.

FAQ

What's the difference between LightRAG and RAG-Anything?

LightRAG is a graph RAG system for text documents — it builds a knowledge graph plus a vector database from text. RAG-Anything is a wrapper that adds multimodal support — scanned PDFs, charts, images, tables, and LaTeX equations. RAG-Anything plugs directly into an existing LightRAG instance.

Do I need a GPU to run RAG-Anything?

No, but ingestion is much slower on CPU. MinerU runs locally and does most of the heavy parsing work. If you're ingesting large batches of documents, switch MinerU to GPU by asking Claude Code to install the correct PyTorch version.

Why can't I upload multimodal documents through the LightRAG web UI?

RAG-Anything uses a separate ingestion pathway that requires a Python script. The web UI only handles plain text. The workaround is to wrap the script as a Claude Code skill so you can ingest multimodal documents by telling Claude Code which files to process.

What does MinerU do in the RAG-Anything pipeline?

MinerU is an open-source document parser that runs locally. It identifies the structural components of a document — headers, text blocks, charts, images, equations — and routes each component to the right extractor (PaddleOCR for text, LaTeX models for equations, screenshots for complex visuals).

Which LLM should I use with RAG-Anything?

The install script defaults to GPT-5.4 nano, which keeps costs low. You can swap in any supported model — larger OpenAI models if you want better extraction quality, or a local model via Ollama if you want zero API cost. Keep text-embedding-3-large for embeddings for consistency.

If you want to go deeper into building multimodal RAG systems with Claude Code, join the free Chase AI community for templates, prompts, and live breakdowns. And if you're serious about building with AI, check out the paid community, Chase AI+, for hands-on guidance on how to make money with AI.

Claude Code + RAG-Anything: Multimodal RAG for PDFs and Charts