Web Scrape Tool
ChimerAI's Web Scraper extracts clean text or Markdown from any public URL, with automatic JavaScript rendering fallback, SSRF protection, and rate limiting.
What you get
- Clean text extraction — Strips ads, navbars, boilerplate
- Markdown output — Converts HTML to clean Markdown via
markdownify - JavaScript rendering — Falls back to Playwright for SPAs
- SSRF protection — Blocks requests to private/internal IPs
- Rate limited — 10 scrapes/min per IP
- Content limit — Max 5MB per page
- Concurrent limit — Max 5 concurrent scrapes (semaphore)
Quick setup
npx chimerai add ai-tools --only web-scrape
Scaffolds:
app/api/tools/web/scrape/route.ts ← Scrape endpoint
services/ai/tools/web_tools.py ← Python scraper
Usage
const res = await fetch('/api/tools/web/scrape', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
url: 'https://nextjs.org/docs',
format: 'markdown', // or 'text' | 'html'
maxLength: 10000, // optional character limit
}),
});
const { content, title, wordCount } = await res.json();
Feeding scraped content to an AI
const { content } = await fetchScrape('https://docs.example.com/api');
const summary = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a documentation summariser.' },
{ role: 'user', content: `Summarise this page:\n\n${content.slice(0, 8000)}` },
],
});
Python implementation
import httpx
from markdownify import markdownify as md
async def scrape_url(url: str, format: str = "markdown") -> dict:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
}
async with httpx.AsyncClient(follow_redirects=True, timeout=20) as client:
response = await client.get(url, headers=headers)
content = response.text
if format == "markdown":
content = md(content, heading_style="ATX")
elif format == "text":
# strip HTML tags
from bs4 import BeautifulSoup
content = BeautifulSoup(content, "lxml").get_text(separator="\n")
return {"content": content[:50_000], "statusCode": response.status_code}
SSRF protection
All scrape requests are validated to prevent internal network access:
import ipaddress, socket
def _check_ssrf(url: str):
hostname = urlparse(url).hostname
try:
ip = ipaddress.ip_address(socket.gethostbyname(hostname))
if ip.is_private or ip.is_loopback or ip.is_link_local:
raise ValueError(f"Blocked: private/internal IP {ip}")
except socket.gaierror:
raise ValueError("Cannot resolve hostname")