RAG Pipeline

May 14, 2025

A Retrieval-Augmented Generation (RAG) pipeline is a technique for interfacing with LLMs that helps to:

Add Context
Improve Accuracy
Check / Filter hallucinations
Preserve Privacy
Add Value

In this age where LLMs are becoming ubiquitous you will very likely need to create one of these sooner or later.

Basic Flow

flowchart LR
  User((User))
  Vector[(Vector DB)]
  LLM{{LLM}}
  Embedding[Embedding Processor]

  Embedding -->|A| RAG -->|B| Vector
  Vector -->|2 & 4| RAG -->|5| User
  User -->|1| RAG -->|3| LLM

There are two distinct flows that need to be performed. The first is the embedding flow, which reads and stores data not directly accessible to the LLM. It vectorizes the data and hands it to the RAG for storage in the Vector DB. Following Single Responsibility Principle, it is important that the RAG be the only system with direct access to the Vector DB.

Separately the Query flow starts with the user asking a question. The RAG system may (or may not) pull data from the Vector DB to augment the query with context the LLM doesn’t have. Care has to be taken to ensure that no IP is leaked to the LLM because queries are considered public and may be used in future LLM training. After the LLM response is returned the RAG can additionally augment the response with embedded context from the Vector DB before handing it back to the user.

RAG Service

The RAG service should be a simple HTTP service that has 2 types of endpoints. The first type will be the CRUD API for the embedding flow. The second is the streaming API for the query flow.

Embedding API

The Embedding API should be your standard CRUD operations. You would be well served using an OpenAPI server. If at all possible we recommend taking the raw data in the API and then vectorizing it in the service for storage in Vector DB.

Query API

The Query API should “stream” the response. There are a few ways to do this which will effect what deployment options you have. One of the easier ways - from the server perspective - is websockets. However, not all loadbalancers or WAFs supoport that. Another option is a Cursor Keep Alive, though that comes at the disadvantage of requiring a separate DB to keep track of the cursors.

Deployment Options

Your simplest option is a serverless function or container. That service would take care of the API endpoint and then you largely just have to make sure that your code completes before the service timeouts.

A Kubernetes Deployment is another viable option. Expose the endpoimt via an Ingress though a high performance IngressController and you should be all set.

Kustomize Rollout Blast Radius - Critical Context