Web Scrape Tool

ChimerAI's Web Scraper extracts clean text or Markdown from any public URL, with automatic JavaScript rendering fallback, SSRF protection, and rate limiting.

What you get

Clean text extraction — Strips ads, navbars, boilerplate
Markdown output — Converts HTML to clean Markdown via markdownify
JavaScript rendering — Falls back to Playwright for SPAs
SSRF protection — Blocks requests to private/internal IPs
Rate limited — 10 scrapes/min per IP
Content limit — Max 5MB per page
Concurrent limit — Max 5 concurrent scrapes (semaphore)

Quick setup

npx chimerai add ai-tools --only web-scrape

Scaffolds:

app/api/tools/web/scrape/route.ts    ← Scrape endpoint
services/ai/tools/web_tools.py       ← Python scraper

Usage

const res = await fetch('/api/tools/web/scrape', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    url: 'https://nextjs.org/docs',
    format: 'markdown', // or 'text' | 'html'
    maxLength: 10000, // optional character limit
  }),
});

const { content, title, wordCount } = await res.json();

Feeding scraped content to an AI

const { content } = await fetchScrape('https://docs.example.com/api');

const summary = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'You are a documentation summariser.' },
    { role: 'user', content: `Summarise this page:\n\n${content.slice(0, 8000)}` },
  ],
});

Python implementation

import httpx
from markdownify import markdownify as md

async def scrape_url(url: str, format: str = "markdown") -> dict:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/124.0.0.0 Safari/537.36"
    }
    async with httpx.AsyncClient(follow_redirects=True, timeout=20) as client:
        response = await client.get(url, headers=headers)

    content = response.text
    if format == "markdown":
        content = md(content, heading_style="ATX")
    elif format == "text":
        # strip HTML tags
        from bs4 import BeautifulSoup
        content = BeautifulSoup(content, "lxml").get_text(separator="\n")

    return {"content": content[:50_000], "statusCode": response.status_code}

SSRF protection

All scrape requests are validated to prevent internal network access:

import ipaddress, socket

def _check_ssrf(url: str):
    hostname = urlparse(url).hostname
    try:
        ip = ipaddress.ip_address(socket.gethostbyname(hostname))
        if ip.is_private or ip.is_loopback or ip.is_link_local:
            raise ValueError(f"Blocked: private/internal IP {ip}")
    except socket.gaierror:
        raise ValueError("Cannot resolve hostname")