RAG Pipeline – Integration Guide

🎥 Video Tutorial: Watch on YouTube

Overview

Retrieval-Augmented Generation (RAG) lets the AI answer questions grounded in your own documents. Upload text, PDFs, or web content — ChimerAI chunks them, embeds them into a FAISS vector store, and retrieves the most relevant snippets before generating a response.

Quick Start

# Adds the RAG module to an existing ChimerAI project
chimerai add rag

⚠️ The RAG engine runs in the Python AI service (FastAPI, port 8002), not in Next.js. The Next.js frontend proxies requests to the Python service via HTTP.

# Start the Python AI service:
cd services/ai
pip install -r requirements.txt
python -m uvicorn main:app --reload --port 8002

💡 Use python -m uvicorn instead of bare uvicorn to ensure the correct Python version is used.

Minimum .env for local use (place in project root or services/ai/):

# OpenAI (cloud embeddings + chat)
OPENAI_API_KEY=sk-...
DEFAULT_CHAT_MODEL=gpt-3.5-turbo
DEFAULT_EMBEDDING_MODEL=text-embedding-ada-002
EMBEDDING_DIMENSION=1536

# --- OR: Ollama (local, free) ---
OLLAMA_BASE_URL=http://localhost:11434
DEFAULT_EMBEDDING_MODEL=nomic-embed-text
EMBEDDING_DIMENSION=768
DEFAULT_CHAT_MODEL=llama3.2

💡 The .env file can live in the project root — the AI service automatically falls back to ../../.env if no local .env exists in services/ai/.

Architecture

Document Upload
  → Chunker (RecursiveTextSplitter, 1000 chars / 200 overlap)
  → Embedder (OpenAI / Ollama / any LiteLLM-compatible model)
  → FAISS Vector Store (IndexFlatL2, persisted to disk)

User Query
  → Embed query with same model
  → Nearest-neighbour search (L2 distance → similarity score)
  → Top-k chunks injected into LLM system prompt
  → Streamed / non-streamed answer

Component Overview

Layer	Technology
Vector Store	FAISS (IndexFlatL2, CPU)
Embeddings	OpenAI `text-embedding-ada-002` · Ollama `nomic-embed-text` · any LiteLLM model
LLM	Any model via LiteLLM (OpenAI, Anthropic, Ollama, Azure, …)
Text Splitting	LangChain `RecursiveCharacterTextSplitter`
API	FastAPI (Python, port 8002)
Frontend Proxy	Next.js API routes (`/api/rag/*`)

Environment Variables

All variables go in .env (project root or services/ai/):

Variable	Default	Description
`DEFAULT_EMBEDDING_MODEL`	`text-embedding-ada-002`	Model for generating embeddings
`EMBEDDING_DIMENSION`	`1536`	Vector dimension — must match the embedding model
`DEFAULT_CHAT_MODEL`	`gpt-3.5-turbo`	LLM used for RAG chat responses
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL (leave set to enable local models)
`OPENAI_API_KEY`	(empty)	OpenAI API key (required for OpenAI models)

Auto-detection of embedding dimension

At startup the AI service sends a test embedding to the configured model and automatically detects the correct vector dimension — no manual EMBEDDING_DIMENSION config needed in most cases. The .env value is used only as a fallback when the test fails (e.g. Ollama not yet running).

Model name auto-prefix

If OLLAMA_BASE_URL is set and your model name has no provider prefix, the service automatically prepends ollama/:

# These are equivalent:
DEFAULT_EMBEDDING_MODEL=nomic-embed-text
DEFAULT_EMBEDDING_MODEL=ollama/nomic-embed-text

Using Ollama (local, free)

Install Ollama and pull the models:

ollama pull nomic-embed-text   # embeddings (768-dim)
ollama pull llama3.2           # chat

Set in .env:

OLLAMA_BASE_URL=http://localhost:11434
DEFAULT_EMBEDDING_MODEL=nomic-embed-text
DEFAULT_CHAT_MODEL=llama3.2
# EMBEDDING_DIMENSION is auto-detected from the model

Start Ollama first, then the AI service — dimension is detected at startup:

# Terminal 1 — start Ollama
ollama serve

# Terminal 2 — start AI service
cd services/ai
python -m uvicorn main:app --reload --port 8002

⚠️ Start Ollama before the AI service. The dimension auto-detection sends a test embedding at startup. If Ollama is not yet running, the test fails silently and the service falls back to EMBEDDING_DIMENSION from .env (default: 1536). This causes a mismatch when you later index documents with a 768-dim model like nomic-embed-text. Simply restart the AI service once Ollama is running — the mismatch is detected and the index is automatically rebuilt.

💡 If you switch embedding models (e.g. from OpenAI to Ollama), the FAISS index is automatically rebuilt on the next startup — no manual file deletion required.

Python AI Service REST API

Base URL: http://localhost:8002

Add documents

POST /api/rag/documents
{
  "documents": ["FastAPI is a modern Python web framework.", "…"],
  "metadatas": [{"source": "docs", "page": 1}]
}

// Response:
{
  "status": "success",
  "added": 4,
  "ids": [0, 1, 2, 3],
  "total_vectors": 4
}

Long documents (>1000 chars) are automatically split into overlapping chunks.

Semantic search

POST /api/rag/search
{ "query": "What is FastAPI?", "k": 5 }

// Response:
{
  "results": [
    { "id": 0, "text": "FastAPI is…", "score": 0.92, "rank": 1, "metadata": {} }
  ]
}

RAG chat (retrieve + generate)

POST /api/rag/chat
{
  "query": "Tell me about FastAPI",
  "k": 3,
  "temperature": 0.7,
  "model": "gpt-3.5-turbo"
}

// Response:
{
  "choices": [{ "message": { "role": "assistant", "content": "FastAPI is…" } }],
  "rag_metadata": {
    "retrieved_documents": 3,
    "documents": [{ "text": "…", "score": 0.91, "metadata": {} }]
  }
}

Other endpoints

Endpoint	Method	Description
`GET /api/rag/stats`	`GET`	Total vectors, dimension, index type
`DELETE /api/rag/delete`	`DELETE`	Delete documents by ID: `{ "ids": [0, 1] }`
`DELETE /api/rag/clear`	`DELETE`	Clear the entire vector store

Next.js Proxy Routes

The generated project includes proxy routes so the frontend can call RAG without exposing the Python service directly:

Route	Proxies to
`POST /api/rag`	`/api/rag/documents`
`POST /api/rag/query`	`/api/rag/search`
`GET /api/rag/stats`	`/api/rag/stats`
`DELETE /api/rag/clear`	`/api/rag/clear`
`DELETE /api/rag/delete`	`/api/rag/delete`

Usage from TypeScript

// Add documents
await fetch('/api/rag', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    documents: ['TypeScript adds static typing to JavaScript.'],
    metadatas: [{ source: 'blog' }],
  }),
});

// RAG chat
const res = await fetch('/api/rag/query', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ query: 'What is TypeScript?', k: 3 }),
});
const data = await res.json();
console.log(data.choices[0].message.content);
console.log('Sources:', data.rag_metadata.documents);

FAISS Index Storage

The vector index is persisted to disk automatically:

File	Description
`services/ai/data/faiss_index`	Binary FAISS index
`services/ai/data/faiss_index.metadata.pkl`	Chunk texts + metadata

To reset the index, either call DELETE /api/rag/clear or delete both files. Switching to an embedding model with a different dimension will automatically rebuild the index on the next startup.

cURL Examples

# Add documents
curl -X POST http://localhost:8002/api/rag/documents \
  -H "Content-Type: application/json" \
  -d '{"documents": ["Document 1 text", "Document 2 text"]}'

# Search
curl -X POST http://localhost:8002/api/rag/search \
  -H "Content-Type: application/json" \
  -d '{"query": "search query", "k": 5}'

# RAG chat
curl -X POST http://localhost:8002/api/rag/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "Your question here", "k": 3}'

# Stats
curl http://localhost:8002/api/rag/stats

# Delete by IDs
curl -X DELETE http://localhost:8002/api/rag/delete \
  -H "Content-Type: application/json" \
  -d '{"ids": [0, 1, 2]}'

# Clear all
curl -X DELETE http://localhost:8002/api/rag/clear

Troubleshooting

`FAISS is not available`

pip install faiss-cpu numpy

Dimension mismatch after switching models

No action needed — the service detects and auto-rebuilds the index on startup. All previously indexed documents are lost; re-index them after the restart.

Ollama connection refused

Make sure Ollama is running and OLLAMA_BASE_URL is correct:

ollama serve          # or: ollama run llama3.2
curl http://localhost:11434/api/tags

`.env` not found

The AI service looks for .env in services/ai/ first, then falls back to the project root (../../.env). Either location works.

`uvicorn: command not found`

Use python -m uvicorn instead of bare uvicorn to avoid PATH / Python version issues.