AI Architecture for Decision-Makers

RAG Data Ingestion Pipeline

Most RAG failures don't start at query time. They start during ingestion. If your documents are poorly parsed, badly chunked, or embedded with the wrong model, no amount of prompt engineering will save you.

8 Steps from Document to Vector

Every step in the ingestion pipeline is a decision point. Get one wrong, and retrieval quality degrades silently. Here's what a production pipeline actually looks like.

Step 1 — Source Connectors

Data Connectors & Loaders

PDFs, web pages, databases, APIs, code repos, emails. Each source type needs a dedicated connector that normalises raw content into a unified Document object.

  • Capture file-level metadata at load time: source path, file type, creation date, owner
  • Six source categories, each with different extraction challenges
Step 2 — Parser

Document Parser

Extracts clean, structured text from raw documents. Handles OCR for scanned content, strips HTML/markdown, normalises encoding. Output: clean plain text.

  • Documents with <50 words after parsing are usually extraction failures
  • Scanned PDFs need OCR; layout-heavy PDFs lose structure on extraction
Step 3 — Metadata Extractor

Content Enrichment

Extracts content-level metadata from parsed text: detected language, word count, section headers, named entities, document summary, topic tags. Merges with file-level metadata from the loader.

  • Add a content_hash for efficient change detection on re-ingestion
  • Rich metadata enables filtered retrieval at query time
Step 4 — Preprocessing

Normalisation & Deduplication

Final text normalisation before chunking: whitespace normalisation, deduplication (MinHash LSH for near-duplicates, exact hashing for exact copies), PII detection and redaction.

  • MinHash LSH catches ~80% of near-duplicate issues that exact hashing misses
  • PII detection is a compliance requirement, not a nice-to-have
Step 5 — Chunking

The Most Critical Decision

Splits clean text into overlapping chunks. THE most impactful configuration decision in the entire pipeline — chunk size and strategy directly determine retrieval precision ceiling.

  • Benchmark 3 chunk sizes (256 / 512 / 1024 tokens) against real queries
  • Three strategies: Fixed-Size, Recursive (default), and Semantic
Step 6 — Embeddings

Vectorisation

Converts each text chunk into a dense vector (typically 768–3072 dimensions). MUST match the embedding model used at query time. This is a hard architectural constraint with no workaround.

  • Lock your embedding model version in code
  • Changing the model requires re-embedding your ENTIRE corpus
Step 7 — Quality Gate

Mandatory Validation

Every embedding MUST pass through this gate. Validates: zero vectors, NaN values, dimension mismatches, near-zero norm vectors. Rejects bad embeddings to a dead-letter queue for investigation.

  • A rejection rate >2% signals upstream parser or cleaner failures
  • This is not optional — it's the only path to storage
Step 8 — Vector Store

Index & Store

Stores validated vectors alongside their metadata and source text in an ANN-optimised index (HNSW, IVF). The offline pipeline ends here. Online retrieval begins here.

  • Always store the original chunk text alongside the vector
  • Schedule index re-optimisation after every 10–20% corpus growth

Reality check: Your RAG success doesn't start at query time. It starts during ingestion. A retriever can only find what was correctly parsed, cleanly chunked, and faithfully embedded.

The interactive visualization below maps the complete production ingestion pipeline — from six source types through eight processing steps to the vector database.

Interactive 3D Pipeline

Drag to rotate. Right-drag or Shift+drag to pan. Scroll to zoom. Hover any node to explore its role, data flow, and production best practices.

Best experienced on desktop. Use the controls at the bottom for Top, Side, and Front views. Live counters show documents flowing through the pipeline.

Need Production Ingestion Architecture?

We design, audit, and build production-grade RAG ingestion pipelines. From source connectors to vector storage — we've done it at scale.

Explore RAG Architecture Talk to an Expert