⚡ You're viewing a live demo of ChimerAI. Data resets daily at midnight UTC.Get the CLI →

Web Scrape Tool

ChimerAI's Web Scraper extracts clean text or Markdown from any public URL, with automatic JavaScript rendering fallback, SSRF protection, and rate limiting.

What you get

  • Clean text extraction — Strips ads, navbars, boilerplate
  • Markdown output — Converts HTML to clean Markdown via markdownify
  • JavaScript rendering — Falls back to Playwright for SPAs
  • SSRF protection — Blocks requests to private/internal IPs
  • Rate limited — 10 scrapes/min per IP
  • Content limit — Max 5MB per page
  • Concurrent limit — Max 5 concurrent scrapes (semaphore)

Quick setup

npx chimerai add ai-tools --only web-scrape

Scaffolds:

app/api/tools/web/scrape/route.ts    ← Scrape endpoint
services/ai/tools/web_tools.py       ← Python scraper

Usage

const res = await fetch('/api/tools/web/scrape', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    url: 'https://nextjs.org/docs',
    format: 'markdown', // or 'text' | 'html'
    maxLength: 10000, // optional character limit
  }),
});

const { content, title, wordCount } = await res.json();

Feeding scraped content to an AI

const { content } = await fetchScrape('https://docs.example.com/api');

const summary = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'You are a documentation summariser.' },
    { role: 'user', content: `Summarise this page:\n\n${content.slice(0, 8000)}` },
  ],
});

Python implementation

import httpx
from markdownify import markdownify as md

async def scrape_url(url: str, format: str = "markdown") -> dict:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/124.0.0.0 Safari/537.36"
    }
    async with httpx.AsyncClient(follow_redirects=True, timeout=20) as client:
        response = await client.get(url, headers=headers)

    content = response.text
    if format == "markdown":
        content = md(content, heading_style="ATX")
    elif format == "text":
        # strip HTML tags
        from bs4 import BeautifulSoup
        content = BeautifulSoup(content, "lxml").get_text(separator="\n")

    return {"content": content[:50_000], "statusCode": response.status_code}

SSRF protection

All scrape requests are validated to prevent internal network access:

import ipaddress, socket

def _check_ssrf(url: str):
    hostname = urlparse(url).hostname
    try:
        ip = ipaddress.ip_address(socket.gethostbyname(hostname))
        if ip.is_private or ip.is_loopback or ip.is_link_local:
            raise ValueError(f"Blocked: private/internal IP {ip}")
    except socket.gaierror:
        raise ValueError("Cannot resolve hostname")

Further reading

ChimerAI Docs · Back to Demo