AI Architecture for Decision-Makers

RAG Data Ingestion Pipeline

Most RAG failures don't start at query time. They start during ingestion. If your documents are poorly parsed, badly chunked, or embedded with the wrong model, no amount of prompt engineering will save you.

Explore 3D Pipeline RAG Architecture

8 Steps from Document to Vector

Every step in the ingestion pipeline is a decision point. Get one wrong, and retrieval quality degrades silently. Here's what a production pipeline actually looks like.

Step 1 — Source Connectors

Data Connectors & Loaders

PDFs, web pages, databases, APIs, code repos, emails. Each source type needs a dedicated connector that normalises raw content into a unified Document object.

Capture file-level metadata at load time: source path, file type, creation date, owner
Six source categories, each with different extraction challenges

Step 2 — Parser

Document Parser

Extracts clean, structured text from raw documents. Handles OCR for scanned content, strips HTML/markdown, normalises encoding. Output: clean plain text.

Documents with <50 words after parsing are usually extraction failures
Scanned PDFs need OCR; layout-heavy PDFs lose structure on extraction

Step 3 — Metadata Extractor

Content Enrichment

Extracts content-level metadata from parsed text: detected language, word count, section headers, named entities, document summary, topic tags. Merges with file-level metadata from the loader.

Add a content_hash for efficient change detection on re-ingestion
Rich metadata enables filtered retrieval at query time

Step 4 — Preprocessing

Normalisation & Deduplication

Final text normalisation before chunking: whitespace normalisation, deduplication (MinHash LSH for near-duplicates, exact hashing for exact copies), PII detection and redaction.

MinHash LSH catches ~80% of near-duplicate issues that exact hashing misses
PII detection is a compliance requirement, not a nice-to-have

Step 5 — Chunking

The Most Critical Decision

Splits clean text into overlapping chunks. THE most impactful configuration decision in the entire pipeline — chunk size and strategy directly determine retrieval precision ceiling.

Benchmark 3 chunk sizes (256 / 512 / 1024 tokens) against real queries
Three strategies: Fixed-Size, Recursive (default), and Semantic

Step 6 — Embeddings

Vectorisation

Converts each text chunk into a dense vector (typically 768–3072 dimensions). MUST match the embedding model used at query time. This is a hard architectural constraint with no workaround.

Lock your embedding model version in code
Changing the model requires re-embedding your ENTIRE corpus

Step 7 — Quality Gate

Mandatory Validation

Every embedding MUST pass through this gate. Validates: zero vectors, NaN values, dimension mismatches, near-zero norm vectors. Rejects bad embeddings to a dead-letter queue for investigation.

A rejection rate >2% signals upstream parser or cleaner failures
This is not optional — it's the only path to storage

Step 8 — Vector Store

Index & Store

Stores validated vectors alongside their metadata and source text in an ANN-optimised index (HNSW, IVF). The offline pipeline ends here. Online retrieval begins here.

Always store the original chunk text alongside the vector
Schedule index re-optimisation after every 10–20% corpus growth

Reality check: Your RAG success doesn't start at query time. It starts during ingestion. A retriever can only find what was correctly parsed, cleanly chunked, and faithfully embedded.

The interactive visualization below maps the complete production ingestion pipeline — from six source types through eight processing steps to the vector database.

Interactive 3D Pipeline

Drag to rotate. Right-drag or Shift+drag to pan. Scroll to zoom. Hover any node to explore its role, data flow, and production best practices.

Best experienced on desktop. Use the controls at the bottom for Top, Side, and Front views. Live counters show documents flowing through the pipeline.

Need Production Ingestion Architecture?

We design, audit, and build production-grade RAG ingestion pipelines. From source connectors to vector storage — we've done it at scale.

Explore RAG Architecture Talk to an Expert