Document Processing Tool
ChimerAI's Document Processing tool extracts text and structured data from PDF, Word, Excel, PowerPoint, and plain text files — ready for AI ingestion or RAG pipelines.
What you get
- PDF extraction — Text + table data via
pdfplumber - Word/DOCX — Full paragraph + table extraction via
python-docx - Excel/XLSX — Sheet data as structured JSON via
openpyxl - PowerPoint/PPTX — Slide text extraction via
python-pptx - Plain text / CSV / Markdown — Direct read
- Chunking — Optional split into overlapping chunks for RAG
Quick setup
npx chimerai add ai-tools --only documents
Scaffolds:
app/api/tools/documents/route.ts ← Upload + extract endpoint
services/ai/tools/document_tools.py ← Python extractor
Usage
const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('chunkSize', '1000');
formData.append('chunkOverlap', '200');
const res = await fetch('/api/tools/documents', {
method: 'POST',
body: formData,
});
const { text, chunks, metadata } = await res.json();
// text: full extracted text
// chunks: Array<{ content: string; index: number }>
// metadata: { pages, wordCount, fileType }
RAG pipeline integration
// Extract text from uploaded document
const { chunks } = await extractDocument(file);
// Embed each chunk
const embeddings = await Promise.all(
chunks.map((c) => openai.embeddings.create({ model: 'text-embedding-3-small', input: c.content }))
);
// Store in vector DB
await vectorDb.upsert(
chunks.map((c, i) => ({
id: `${docId}-${i}`,
values: embeddings[i].data[0].embedding,
metadata: { text: c.content, docId },
}))
);
Python implementation
import pdfplumber
from docx import Document as DocxDocument
import openpyxl
def extract_pdf(path: str) -> str:
with pdfplumber.open(path) as pdf:
return "\n\n".join(
page.extract_text() or "" for page in pdf.pages
)
def extract_docx(path: str) -> str:
doc = DocxDocument(path)
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
def extract_xlsx(path: str) -> str:
wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
rows = []
for sheet in wb.worksheets:
for row in sheet.iter_rows(values_only=True):
rows.append("\t".join(str(c or "") for c in row))
return "\n".join(rows)