Changelog
This page is generated from the root CHANGELOG.md, which is maintained by release-please during releases.
:::note The source of truth is the repository root changelog. Do not edit this docs page manually. :::
All notable changes to CRW are documented here.
0.10.0 (2026-05-20)
Features
- detector: add vendor-specific anti-bot block markers (c88c508)
- renderer: add chrome_proxy as 4th fallback tier (b4da4f7)
- renderer: per-request country via CDP proxy auth (11b4d32)
Bug Fixes
- release: harden npm publish + fix mcp-registry verifier (9d4076f)
- renderer: detect CloudFront/WAF 403 as bot-wall (7e058b2)
- renderer: escalate JS tier on 4xx/5xx and vendor-detected blocks (648c372)
0.9.1 (2026-05-16)
Bug Fixes
- release: sync crw-cli internal dep versions with workspace (26c528e)
0.9.0 (2026-05-16)
Features
- cli: add AI extraction flags and
crw setup --reset(912eea0)
0.8.3 (2026-05-15)
Features
- cli: two-phase auto-fallback for
crw <url>scrape (a871e54) - setup: make config.toml the canonical source for
crw setup(b07c154)
Miscellaneous
0.8.2 (2026-05-15)
Features
- cli: two-phase auto-fallback for
crw <url>scrape (a871e54) - setup: make config.toml the canonical source for
crw setup(b07c154)
Miscellaneous
- release 0.8.2 (38ae764)
0.8.2 (2026-05-14)
Bug Fixes
- release: move crw-cli to unpublished and update dep versions (7f121f6)
0.8.1 (2026-05-14)
Bug Fixes
- cli: mark crw-cli as publish=false to fix release (3104cc5)
0.8.0 (2026-05-14)
Features
- cli: add interactive setup wizard (a5613b9)
0.7.1 (2026-05-12)
Bug Fixes
0.7.0 (2026-05-12)
Features
0.6.4 (2026-05-12)
Features
0.6.3 (2026-05-12)
Features
0.6.2 (2026-05-10)
Features
- search: add /v1/search endpoint backed by bundled SearXNG sidecar (f4bd7f4)
Bug Fixes
- antibot: drop bare 'captcha'/'access denied' markers — false positives (fae6c09)
- crawl: drop redundant
.into_iter()for clippy 1.95 (#39) (fb4032b) - map: WordPress sitemap-index timeout (closes #33) (c3dfd6c)
- release: register crw-search crate in release manifest (9074761)
- search: codex iteration-1 hardening — error mapping, resource bounds, container (5acba7b)
- search: codex iteration-2 — error-body cap, per-source row budget, doc (a440d6e)
- search: codex iteration-3 — predicate-based well-formed filter (4b4df3a)
- search: use real SearXNG image tag and add fallback secret_key (be1f403)
0.6.1 (2026-05-09)
Features
- metrics: cdp_pending_requests, cdp_live_connections, (b5f7bec)
- renderer: live-connection registry + 60s telemetry sampler (b5f7bec)
- renderer: target lifecycle metric + leaked detection (b5f7bec)
- server: /ready endpoint with deep status code (b5f7bec)
Bug Fixes
- release: bulletproof publish pipeline and drop pdf feature (8fcf2f6)
- renderer: invalidate cached chrome WS URL on connect failure (b5f7bec)
0.6.0 (2026-05-09)
Features
- extract: scale recall to 63.74% on 1000-URL benchmark (5b85555)
- renderer: add browserless/chromium opt-in stealth profile (+2.5pt) (d2414c9)
- renderer: chrome-stealth wiring + CDP discovery improvements (6b2e77c)
- server,core,crawl: plumb tier timeouts and recall pipeline (7cbee43)
Miscellaneous
- release 0.6.0 (bd03a35)
0.5.0 (2026-05-04)
Features
- core: add deadline module and request/renderer config scaffolding (5a4e69a)
- core: thread end-to-end Deadline through scrape pipeline (5991986)
- crawl: key per-domain rate limiter by eTLD+1 (39c7954)
- crawl: per-host concurrency cap on the eTLD+1 limiter (274f462)
- renderer: add browserless/chromium opt-in stealth profile (236f626)
- renderer: chrome nav-budget cap + truncated/deadline_exceeded flags (c57cef8)
- renderer: chrome request-paused interception pump (T27) (13fcaa4)
- renderer: leak-through fallback when global breaker open & host clean (86a9e36)
- renderer: outcome-aware breaker + extraction and stealth fixes (86dd10f)
- renderer: own per-eTLD+1 host limiter in FallbackRenderer (0577516)
- renderer: recover FC-wins URLs (lifts truth-recall toward the canonical 3-way result; see
bench/server-runs/RESULT_3WAY_1000_FULL.md) (ba12424)
Bug Fixes
- compose: auto-restart and bound memory for renderer containers (dd610cc)
- core: emit meaningful Timeout value when deadline already expired (607bb27)
- crawl: prioritize anti-bot detection over placeholder warning (05aa933)
- escalate to JS renderer on HTTP failure and empty markdown (9fc7934)
- mcp: apply per-endpoint timeouts to proxy client (741f1b2)
- renderer: enforce Deadline in HttpFetcher via tokio::time::timeout (b1c4058)
- renderer: keep larger thin-result HTML when stitching attempts (8147236)
- renderer: rescue 39 bench failures via UA, retry, and thin-content escalation (ddacb49)
- server: classify anti-bot challenges as anti_bot, not no-markdown (3ece4dd)
Performance
- renderer: drop fixed 2s JS wait, rely on SPA selector poll (cb043f7)
- renderer: tighten tier timeouts and bump LP retry threshold (3f93d60)
- renderer: widen breaker tolerance to 20 failures / 10s cooldown (6525a84)
Miscellaneous
- release 0.5.0 (3987de1)
0.4.2 (2026-04-29)
Features
- core: add render decision types and prometheus metrics scaffold (e08682b)
- renderer: add per-host renderer preference cache (21e41d1)
- renderer: track HTTP routing and warn on pinned-renderer failure (3208d27)
- renderer: wire host preferences, circuit breakers, and CF detection (0c53c64)
Bug Fixes
- core,renderer: surface render metadata and harden host normalization (ee4130b)
- renderer: correct failure classification and routing decisions (4d684bd)
- renderer: probe lifecycle, RAII guard, breaker counter (02044f5)
0.4.1 (2026-04-28)
Features
- add per-request renderer field for scrape and crawl APIs (#29) (f1e0b63)
- crw-browse: add interactive browser MCP server with phase-2 tools (e78879d)
- honor renderer mode and force_js in config (fixes #28) (b76e473)
Bug Fixes
- detect failed JS renders and fail over to next renderer (fca8fd5)
- docs: use absolute logo paths in site.config.js (c5c9321)
- docs: use absolute paths for logo and favicon assets (cdb1451)
0.4.0 (2026-04-22)
Features
- add crw-browse MCP server, SOCKS5 proxy, extract mcp-proto (9a53753)
Miscellaneous
- release 0.4.0 (e15fc74)
0.3.6 (2026-04-21)
Features
- ci: add Google Indexing API notification for docs changes (3b5a340)
- docs: generate static HTML pages for SEO indexability (7b321c0)
Bug Fixes
- ci: trigger release workflow after release-please creates tag (27f2b67)
- mcp: bump npm optionalDependencies from 0.3.0 to 0.3.5 (0e363e0)
- renderer: detect loading placeholders and poll for content stability (d3b642b)
0.3.5 (2026-04-09)
Features
- mcp: add crw_search tool for cloud/proxy mode (7fe4a8e)
0.3.4 (2026-04-09)
Bug Fixes
0.3.3 (2026-04-09)
Features
- add APT/Debian package distribution (c34b8e9)
- renderer: spawn all available browsers for multi-renderer fallback (f546437)
0.3.2 (2026-04-08)
Bug Fixes
- cli: auto-prepend https:// when no scheme provided (1050606)
0.3.1 (2026-04-08)
Features
- add llms.txt, SKILL.md, MCP init command, and docs UI improvements (1b22d19)
- add one-line install script with auto platform detection (6354f79)
- docs: add dark mode logo support and improve docs UI (047df7b)
- docs: align design with SaaS site and update branding (631d07c)
- docs: unify docs into docs.fastcrw.com with Mintlify-style design (4994998)
- docs: update URLs, dark mode, syntax highlighting, and benchmarks (0678cdf)
- release all 3 binaries, CLI auto-browser, README overhaul (aa2950d)
- update README banner with new logo (bcba1ad)
Bug Fixes
- crawl HTTP polling bug + SDK test suite + docs (#16) (b6d8983)
- remove internal implementation detail from roadmap (a5013f0)
0.3.0 (2026-04-02)
Features
- add search() method to Python SDK and docs (591e3fe)
0.2.2 (2026-04-02)
Bug Fixes
- renderer: escalate to JS renderer on HTTP 401/403 responses (f515caa)
- use GitHub latest release instead of pinned version for binary download (4afcb1a)
0.2.1 (2026-03-28)
Bug Fixes
- make crw-mcp npm wrapper executable (576a9eb)
- use latest tag in server.json OCI identifier (7ec3b82)
0.2.0 (2026-03-28)
Features
- add MCP Registry support for official server discovery (154b9f5)
0.1.2 (2026-03-27)
Bug Fixes
- vendor pdf-inspector as crw-pdf for crates.io publishability (3f7681d)
0.1.1 (2026-03-26)
Bug Fixes
- skip already-published crates without masking real errors (010649c)
0.1.0 (2026-03-26)
Features
- add PDF extraction support via pdf-inspector (06dd5bf)
0.0.14 (2026-03-25)
Features
- mcp: auto-download LightPanda binary for zero-config JS rendering (41f443b)
- mcp: auto-spawn headless Chrome for JS rendering in embedded mode (9a6b0ae)
Bug Fixes
- ci: move crw-mcp to Tier 4 in release workflow and add workflow_dispatch (d7584a8)
0.0.13 (2026-03-24)
Features
- mcp: add embedded mode — self-contained MCP server, no crw-server needed (75e5450)
Bug Fixes
- ci: switch release-please to simple type for Rust workspace support (51cd420)
v0.0.12
- Readability drill-down — when
<main>or<article>wraps >90% of body, the extractor now searches inside for narrower content elements (.main-page-content,.article-content,.entry-content, etc.) instead of discarding. Fixes MDN pages returning 35 chars and StackOverflow returning only the question - Base64 image stripping —
data:URI images are removed in both HTML cleaning (lol_html) and markdown post-processing (regex safety net). Eliminates massive base64 blobs from Reddit and similar sites - Select/dropdown removal —
<select>elements removed inonlyMainContentmode; dropdown/city-selector/location-selector noise patterns added. Fixes Hürriyet city dropdown leaking into content - Extended scored selectors — added
.main-page-content,.js-post-body,.s-prose,#question,.page-content,#page-content,[role="article"]for better MDN, StackOverflow, and generic site coverage - Smarter fallback chain — when primary extraction produces too-short markdown, both fallbacks (cleaned HTML and basic clean) are tried and the longer result is picked, instead of short-circuiting on non-empty but insufficient content
v0.0.11
- Stealth anti-bot bypass — automatic stealth JS injection via
Page.addScriptToEvaluateOnNewDocumentbefore every CDP navigation. Spoofsnavigator.webdriver, Chrome runtime object, plugins array, languages, permissions API, iframecontentWindow, andtoString()proxy to bypass Cloudflare, PerimeterX, and other bot detection systems - Cloudflare challenge auto-retry — detects Cloudflare JS challenge pages ("Just a moment",
cf-browser-verification,challenge-platform) after page load and polls up to 3 times at 3-second intervals for non-interactive challenges to auto-resolve - HTTP → CDP auto-escalation —
FallbackRenderer::fetch()in auto mode now checks HTTP responses for anti-bot challenge signatures and automatically escalates to JS rendering when detected, instead of returning the challenge HTML - Chrome failover in Docker — full automatic failover chain: HTTP → LightPanda → Chrome. Added
chromedp/headless-shellas a Docker Compose sidecar service with 2GB shared memory. If LightPanda crashes on complex SPAs (React, Angular), Chrome handles the render - Chrome WS URL auto-discovery — CDP renderer resolves Chrome DevTools WebSocket URL via the
/json/versionHTTP endpoint withHost: localhostheader (required for chromedp/headless-shell's socat proxy). UsesOnceCellfor lazy one-time resolution - Proxy configuration docs — expanded proxy config comments with examples for HTTP, SOCKS5, and residential proxy providers (IPRoyal, Oxylabs, Smartproxy)
- Raw string delimiter fix — fixed
markdown.rstest that usedr#"..."#with a string containing"#, changed tor##"..."##
v0.0.10 / v0.0.9
- Crawl cancel endpoint —
DELETE /v1/crawl/{id}cancels a running crawl job viaAbortHandleand returns{ success: true } - API rate limiting — token-bucket rate limiter (configurable
rate_limit_rps, default 10). Returns 429 witherror_code: "rate_limited"when exceeded - Machine-readable error codes — all error responses now include an
error_codefield (e.g."invalid_url","http_error","rate_limited","not_found") - Map response envelope —
/v1/mapnow returns{ success, data: { links } }instead of{ success, links }for consistency with other endpoints - Fenced code blocks — indented code blocks (4-space) are post-processed into fenced (```) blocks for better LLM/RAG compatibility
- Sphinx footer cleanup —
"footer"added to exact-token noise patterns, catching<div class="footer">in Sphinx/documentation sites renderedWith: "http"— HTTP-only fetches now reportrendered_with: "http"in metadata instead ofnull- 405 JSON responses — all routes now have
.fallback(method_not_allowed)returning structured JSON witherror_code: "method_not_allowed"instead of empty bodies - Anchor link cleanup — empty anchor links (
[](#id),[¶](#id)) and pilcrow/section signs stripped from Markdown output role="contentinfo"cleanup — elements with ARIA rolescontentinfo,navigation,banner,complementaryremoved during cleaning- Tiny chunk merging — topic chunking merges heading-only chunks (<50 chars) with the next chunk to improve RAG embedding quality
v0.0.8
- Wikipedia / MediaWiki onlyMainContent fix —
onlyMainContent: truenow correctly extracts article text from Wikipedia pages (~49% size reduction). Previously the<html>element'sclass="vector-toc-available"matched the"toc"noise pattern via substring, removing the entire page - 3-tier noise pattern matching — noise class/id matching now uses substring (long patterns), exact-token (short/ambiguous:
toc,share,social,comment,related), and prefix (ad-,ads-) matching to avoid false positives - Structural element guard — noise handler never removes
<html>,<head>,<body>, or<main>elements - Re-clean after readability — readability output is re-cleaned to strip residual noise (infobox, navbox, catlinks) that survives inside broad containers
- Wikipedia-aware readability — added
.mw-parser-output,#mw-content-text,#bodyContentto scored selectors; priority/scored selectors that wrap >90% of body are skipped - BYOK LLM extraction — per-request
llmApiKey,llmProvider,llmModelfields for bring-your-own-key structured extraction without server config - JSON format validation —
formats: ["json"]withoutjsonSchemanow returns a 400 error instead of a warning - Block detection skip — pages >50 KB skip interstitial/block detection (no more false "blocked by anti-bot" on Wikipedia)
- Null byte URL rejection — URLs with
%00or null bytes rejected at validation - Request timeout — default timeout bumped from 60s to 120s
- Dockerfile fix — corrected
cargo buildflags, addedconfig.docker.toml
v0.0.7
success: falseon 4xx targets — scraping a 403/404/429 target with minimal body now correctly returnssuccess: falsewith error details, instead ofsuccess: truewith a warning. Targets with real content (custom error pages) still returnsuccess: truewith a warning- JS renderer fallback warning — when
renderJs: trueis requested but no CDP renderer is available, the response now includesrendered_with: "http_only_fallback"and a warning instead of silently falling back - CDP health check —
is_available()now runs a realBrowser.getVersioncommand instead of just testing the WebSocket connection - Specific error messages — unknown formats now return descriptive errors (e.g.,
"Unknown format 'extract'. Valid formats: ...") instead of generic 422 "extract"format alias —formats: ["extract"]andformats: ["llm-extract"]are now accepted as aliases for"json"(Firecrawl compatibility)- Chunk dedup by default — deduplication is now enabled by default for all chunking strategies; separator-only chunks (
---,***) are filtered out - Chunk relevance scores — chunks now return
{ content, score, index }objects instead of plain strings when a query is provided - Map timeout —
/v1/mapaccepts atimeoutparameter (default 120s, max 300s) to prevent 502s on large sites - Stealth + JS rendering fix —
stealth: truewithrenderJs: trueno longer bypasses CDP; the shared renderer is used with stealth headers injected - BM25 NaN guard — prevents
NaNscores when all chunks are empty
v0.0.6
- Crate READMEs on crates.io — all 7 crates now have detailed README documentation visible on their crates.io pages, with usage examples, API docs, and installation instructions
v0.0.5
crw-clinow on crates.io — install the standalone CLI withcargo install crw-cliand scrape URLs without running a server- Parallelized release workflow — crate publishing uses tiered parallelism, cutting release time by ~2.25 minutes
- CLI and MCP install docs — README now includes
cargo installinstructions for bothcrw-cliandcrw-mcp
v0.0.4
- Hardened rendering and warning semantics — improved reliability of the rendering pipeline and warning detection logic
- XPath output escaping — XPath extraction results are now properly escaped to prevent injection
- Broadened status warnings — expanded HTTP status code range that triggers warning metadata
- Capped interstitial scan — bounded interstitial page detection to avoid excessive scanning
- Clippy cleanup — simplified status code checks for cleaner, idiomatic Rust
v0.0.3
- Warning-aware target handling — 4xx and anti-bot targets now return
success: truewithwarningandmetadata.statusCode - More reliable JS rendering — CDP navigation now waits for real page lifecycle completion before applying
waitFor - Stealth decompression fix — gzip and brotli responses decode cleanly instead of leaking garbled binary payloads
- Crawl compatibility —
limit,maxPages, andmax_pagesnow normalize to the same crawl cap - XPath and chunking fixes — XPath returns all matches, chunk overlap/dedupe is supported, and scorer rank order is preserved
v0.0.2
- CSS selector & XPath — target specific DOM elements before Markdown conversion (
cssSelector,xpath) - Chunking strategies — split content into topic, sentence, or regex-delimited chunks for RAG pipelines (
chunkStrategy) - BM25 & cosine filtering — rank chunks by relevance to a query and return top-K results (
filterMode,topK) - Better Markdown — switched to
htmd(Turndown.js port): tables, code block languages, nested lists all render correctly - Stealth mode — rotate User-Agent from a built-in Chrome/Firefox/Safari pool and inject 12 browser-like headers (
stealth: true) - Per-request proxy — override the global proxy on a per-request basis (
proxy: "http://...") - Rate limit jitter — randomized delay between requests to avoid uniform traffic fingerprinting
crw-server setup— one-command JS rendering setup: downloads LightPanda, createsconfig.local.toml
v0.0.1
- Firecrawl-compatible REST API —
/v1/scrape,/v1/crawl,/v1/mapwith identical request/response format - 6 output formats — markdown, HTML, cleaned HTML, raw HTML, plain text, links, structured JSON
- LLM structured extraction — JSON schema in, validated structured data out (Anthropic tool_use + OpenAI function calling)
- JS rendering — auto-detect SPAs via heuristics, render via LightPanda, Playwright, or Chrome (CDP)
- BFS crawler — async crawl with rate limiting, robots.txt, sitemap support, concurrent jobs
- MCP server — built-in stdio + HTTP transport for Claude Code and Claude Desktop
- SSRF protection — private IPs, cloud metadata, IPv6, dangerous URI filtering
- Docker ready — multi-stage build with LightPanda sidecar