Document Processing Tool

ChimerAI's Document Processing tool extracts text and structured data from PDF, Word, Excel, PowerPoint, and plain text files — ready for AI ingestion or RAG pipelines.

What you get

PDF extraction — Text + table data via pdfplumber
Word/DOCX — Full paragraph + table extraction via python-docx
Excel/XLSX — Sheet data as structured JSON via openpyxl
PowerPoint/PPTX — Slide text extraction via python-pptx
Plain text / CSV / Markdown — Direct read
Chunking — Optional split into overlapping chunks for RAG

Quick setup

npx chimerai add ai-tools --only documents

Scaffolds:

app/api/tools/documents/route.ts     ← Upload + extract endpoint
services/ai/tools/document_tools.py  ← Python extractor

Usage

const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('chunkSize', '1000');
formData.append('chunkOverlap', '200');

const res = await fetch('/api/tools/documents', {
  method: 'POST',
  body: formData,
});

const { text, chunks, metadata } = await res.json();
// text: full extracted text
// chunks: Array<{ content: string; index: number }>
// metadata: { pages, wordCount, fileType }

RAG pipeline integration

// Extract text from uploaded document
const { chunks } = await extractDocument(file);

// Embed each chunk
const embeddings = await Promise.all(
  chunks.map((c) => openai.embeddings.create({ model: 'text-embedding-3-small', input: c.content }))
);

// Store in vector DB
await vectorDb.upsert(
  chunks.map((c, i) => ({
    id: `${docId}-${i}`,
    values: embeddings[i].data[0].embedding,
    metadata: { text: c.content, docId },
  }))
);

Python implementation

import pdfplumber
from docx import Document as DocxDocument
import openpyxl

def extract_pdf(path: str) -> str:
    with pdfplumber.open(path) as pdf:
        return "\n\n".join(
            page.extract_text() or "" for page in pdf.pages
        )

def extract_docx(path: str) -> str:
    doc = DocxDocument(path)
    return "\n".join(p.text for p in doc.paragraphs if p.text.strip())

def extract_xlsx(path: str) -> str:
    wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
    rows = []
    for sheet in wb.worksheets:
        for row in sheet.iter_rows(values_only=True):
            rows.append("\t".join(str(c or "") for c in row))
    return "\n".join(rows)

Document Processing Tool

What you get

Quick setup

Usage

RAG pipeline integration

Python implementation

Further reading