Vision Tool (Image Analysis)

ChimerAI's Vision Tool analyses images using multimodal AI models (GPT-4o Vision, Claude 3, Gemini) — describe images, extract text (OCR), detect objects, read charts, or answer questions about visual content.

What you get

Image description — Natural language description of image contents
OCR / text extraction — Extract all text from images, screenshots, scanned documents
Object detection — List objects, people, locations in the image
Chart/graph reading — Extract data from bar charts, pie charts, line graphs
Custom questions — Ask any question about the image
URL or upload — Accepts image URLs or base64-encoded uploads
Multi-image comparison — Compare two images in one prompt

Quick setup

npx chimerai add ai-tools --only vision

Scaffolds:

app/api/tools/vision/route.ts        ← Vision endpoint
services/ai/tools/vision_tools.py    ← Python vision implementation

Usage — describe an image

const res = await fetch('/api/tools/vision/analyse', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    imageUrl: 'https://example.com/screenshot.png',
    task: 'describe',
    prompt: 'What is shown in this image?',
  }),
});

const { result } = await res.json();

Usage — OCR (text extraction)

const res = await fetch('/api/tools/vision/analyse', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    imageUrl: 'https://example.com/invoice.jpg',
    task: 'ocr',
    prompt: 'Extract all text from this image.',
  }),
});

const { result } = await res.json();
// result: "Invoice #1234\nDate: 2025-01-15\nTotal: $499.00"

Usage — upload (base64)

const file = fileInput.files[0];
const reader = new FileReader();
reader.onload = async () => {
  const base64 = (reader.result as string).split(',')[1];

  const res = await fetch('/api/tools/vision/analyse', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      imageBase64: base64,
      mimeType: file.type,
      task: 'custom',
      prompt: 'List all visible UI elements and their labels.',
    }),
  });
};
reader.readAsDataURL(file);

Python implementation

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def analyse_image(
    image_url: str | None,
    image_base64: str | None,
    mime_type: str,
    prompt: str,
) -> str:
    if image_base64:
        content_url = f"data:{mime_type};base64,{image_base64}"
    else:
        content_url = image_url

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": content_url}},
            ],
        }],
        max_tokens=1024,
    )
    return response.choices[0].message.content

Use cases

Invoice processing — OCR + structured data extraction
UI screenshot analysis — QA automation, accessibility checks
Product image tagging — Auto-generate alt text and tags
Chart data extraction — Parse dashboards and reports
Document digitisation — Scan and extract form fields