AI Skill Hub 推荐使用:开源AI工具:Go web爬虫 是一款优质的AI工具。AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案,这是一个值得深入了解的选择。
使用Go语言开发的web爬虫工具,用于爬取文档网站并转换内容为清晰的Markdown格式,提高文档管理效率。
开源AI工具:Go web爬虫 是一款基于 Go 开发的开源工具,专注于 web-scraper、go、llm 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
使用Go语言开发的web爬虫工具,用于爬取文档网站并转换内容为清晰的Markdown格式,提高文档管理效率。
开源AI工具:Go web爬虫 是一款基于 Go 开发的开源工具,专注于 web-scraper、go、llm 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:go install(推荐) go install github.com/Sriram-PR/doc-scraper@latest # 方式二:从源码编译 git clone https://github.com/Sriram-PR/doc-scraper cd doc-scraper go build -o doc-scraper . # 方式三:下载预编译二进制 # 访问 Releases 页面下载对应平台二进制文件 # https://github.com/Sriram-PR/doc-scraper/releases
# 查看帮助 doc-scraper --help # 基本运行 doc-scraper [options] <input> # 详细使用说明请查阅文档 # https://github.com/Sriram-PR/doc-scraper
# doc-scraper 配置说明 # 查看配置选项 doc-scraper --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export DOC_SCRAPER_CONFIG="/path/to/config.yml"
A configurable, concurrent, and resumable web crawler written in Go. Specifically designed to scrape technical documentation websites, extract core content, convert it cleanly to Markdown format suitable for ingestion by Large Language Models (LLMs), and save the results locally.
This project provides a powerful command-line tool to crawl documentation sites based on settings defined in a config.yaml file. It navigates the site structure, extracts content from specified HTML sections using CSS selectors, and converts it into clean Markdown files.
| Feature | Description |
|---|---|
| **Configurable Crawling** | Uses YAML for global and site-specific settings |
| **Scope Control** | Limits crawling by domain, path prefix, and disallowed path patterns (regex) |
| **Content Extraction** | Extracts main content using CSS selectors |
| **HTML-to-Markdown** | Converts extracted HTML to clean Markdown |
| **Image Handling** | Optional downloading and local rewriting of image links with domain and size filtering |
| **Link Rewriting** | Rewrites internal links to relative paths for local structure |
| **URL-to-File Mapping** | Optional TSV file logging saved file paths and their corresponding original URLs |
| **YAML Metadata Output** | Optional detailed YAML file per site with crawl stats and per-page metadata (including content hashes) |
| **Concurrency** | Configurable worker pools and semaphore-based request limits (global and per-host) |
| **Rate Limiting** | Configurable per-host delays with jitter |
| **Robots.txt & Sitemaps** | Respects robots.txt and processes discovered sitemaps |
| **State Persistence** | Uses BadgerDB for state; supports resuming crawls via resume subcommand |
| **Graceful Shutdown** | Handles SIGINT/SIGTERM with proper cleanup |
| **HTTP Retries** | Exponential backoff with jitter for transient errors |
| **Observability** | Structured logging (logrus); optional pprof endpoint (build with -tags pprof) |
| **Modular Code** | Organized into packages for clarity and maintainability |
| **CLI Utilities** | Built-in validate and list-sites commands for configuration management |
| **MCP Server Mode** | Expose as Model Context Protocol server for Claude Code/Cursor integration |
| **Auto Content Detection** | Automatic framework detection (Docusaurus, MkDocs, Sphinx, GitBook, ReadTheDocs) with readability fallback |
| **Parallel Site Crawling** | Crawl multiple sites concurrently with shared resource management |
| **Watch Mode** | Scheduled periodic re-crawling with state persistence |
Option 1: Direct Installation (Recommended)
Install the latest version directly from GitHub:
go install github.com/Sriram-PR/doc-scraper/cmd/doc-scraper@latest
This installs the doc-scraper binary to your GOPATH/bin directory (usually ~/go/bin or %USERPROFILE%\go\bin). Make sure this directory is in your PATH.
Option 2: Clone and Build
git clone https://github.com/Sriram-PR/doc-scraper.git
cd doc-scraper
go mod tidy
make build
# or: go build -o doc-scraper ./cmd/doc-scraper
This creates an executable named doc-scraper in the project root.
config.yaml file (see Configuration section) ./doc-scraper crawl -site your_site_key -loglevel info
./crawled_docs/ directory```yaml
Execute the compiled binary from the project root directory:
./doc-scraper <command> [options]
Basic Crawl:
./doc-scraper crawl -site tensorflow_docs -loglevel info
Resume a Large Crawl:
./doc-scraper resume -site pytorch_docs -loglevel info
Validate Configuration:
./doc-scraper validate -config config.yaml
./doc-scraper validate -site pytorch_docs # Validate specific site
List Available Sites:
./doc-scraper list-sites
High Performance Crawl with Profiling:
./doc-scraper crawl -site small_docs -loglevel warn -pprof localhost:6060
Debug Mode for Troubleshooting:
./doc-scraper crawl -site test_site -loglevel debug
Parallel Crawl of Multiple Sites:
./doc-scraper crawl -sites pytorch_docs,tensorflow_docs,langchain_docs
Crawl All Configured Sites:
./doc-scraper crawl --all-sites
Start MCP Server for Claude Desktop:
./doc-scraper mcp-server -config config.yaml
Start MCP Server with SSE Transport:
./doc-scraper mcp-server -config config.yaml -transport sse -port 8080
sites:
pytorch_docs:
start_urls:
- "https://pytorch.org/docs/stable/"
allowed_domain: "pytorch.org"
allowed_path_prefix: "/docs/stable/"
content_selector: "auto" # Auto-detect framework
max_depth: 0
```bash
```bash
INFO Starting watch mode for 2 sites with interval 24h0m0s
INFO Watch schedule:
INFO pytorch_docs: last run 2024-01-15T10:30:00Z (success, 1500 pages), next run 2024-01-16T10:30:00Z
INFO tensorflow_docs: never run, will run immediately
INFO Running crawl for 1 due sites: [tensorflow_docs]
...
INFO Next crawl: pytorch_docs in 23h45m (at 10:30:00)
Stdio Transport (for Claude Desktop/Cursor):
./doc-scraper mcp-server -config config.yaml
SSE Transport (HTTP-based):
./doc-scraper mcp-server -config config.yaml -transport sse -port 8080
List available sites:
Tool: list_sites
Result: Returns all configured sites with their domains and crawl status
Fetch a single page:
Tool: get_page
Arguments: { "url": "https://docs.example.com/guide", "content_selector": "article" }
Result: Returns page content as markdown with metadata
Start a background crawl:
Tool: crawl_site
Arguments: { "site_key": "pytorch_docs", "incremental": true }
Result: Returns job ID for tracking progress
Check crawl progress:
Tool: get_job_status
Arguments: { "job_id": "abc-123-def" }
Result: Returns status, pages processed, and completion info
Search crawled content:
Tool: search_crawled
Arguments: { "query": "neural network", "site_key": "pytorch_docs", "max_results": 10 }
Result: Returns matching pages with snippets
A config.yaml file is required to run the crawler. Create this file in the project root or specify its path using the -config flag.
When configuring for LLM documentation processing, pay special attention to these settings:
sites.<your_site_key>.content_selector: Define precisely to capture only relevant textsites.<your_site_key>.allowed_domain / allowed_path_prefix: Define scope accuratelyskip_images: Set to true globally or per-site if images aren't needed for the LLMdefault_delay_per_host: 500ms num_workers: 8 num_image_workers: 8 max_requests: 48 max_requests_per_host: 4 output_base_dir: "./crawled_docs" state_dir: "./crawler_state" max_retries: 4 initial_retry_delay: 1s max_retry_delay: 30s semaphore_acquire_timeout: 30s global_crawl_timeout: 0s skip_images: false # Set to true to skip images globally max_image_size_bytes: 10485760 # 10 MiB enable_output_mapping: true output_mapping_filename: "global_url_map.tsv" enable_metadata_yaml: true metadata_yaml_filename: "crawl_meta.yaml"
http_client_settings: timeout: 45s max_idle_conns_per_host: 6
sites: # Key used with -site flag pytorch_docs: start_urls: - "https://pytorch.org/docs/stable/" allowed_domain: "pytorch.org" allowed_path_prefix: "/docs/stable/" content_selector: "article.pytorch-article .body" max_depth: 0 # 0 for unlimited depth skip_images: false # Override global mapping filename for this site output_mapping_filename: "pytorch_docs_map.txt" metadata_yaml_filename: "pytorch_metadata_output.yaml" disallowed_path_patterns: - "/docs/stable/./_modules/." - "/docs/stable/.\.html#."
tensorflow_docs: start_urls: - "https://www.tensorflow.org/guide" - "https://www.tensorflow.org/tutorials" allowed_domain: "www.tensorflow.org" allowed_path_prefix: "/" content_selector: ".devsite-article-body" max_depth: 0 delay_per_host: 1s # Site-specific override # Disable mapping for this site, overriding global enable_output_mapping: false enable_metadata_yaml: false disallowed_path_patterns: - "/install/." - "/js/." ```
| Option | Type | Description | Default |
|---|---|---|---|
default_user_agent | String | Default User-Agent header for requests | "" (Go default) |
default_delay_per_host | Duration | Time to wait between requests to the same host | 0s (no delay) |
num_workers | Integer | Number of concurrent crawl workers | 4 |
num_image_workers | Integer | Number of concurrent image download workers | same as num_workers |
max_requests | Integer | Maximum concurrent requests (global) | 10 |
max_requests_per_host | Integer | Maximum concurrent requests per host | 2 |
output_base_dir | String | Base directory for crawled content | "./crawled_docs" |
state_dir | String | Directory for BadgerDB state data | "./crawler_state" |
max_retries | Integer | Maximum retry attempts for HTTP requests | 3 |
initial_retry_delay | Duration | Initial delay for retry backoff | 1s |
max_retry_delay | Duration | Maximum delay for retry backoff | 30s |
semaphore_acquire_timeout | Duration | Timeout for acquiring the global semaphore | 30s |
global_crawl_timeout | Duration | Overall timeout for the entire crawl | 0s (no timeout) |
per_page_timeout | Duration | Timeout for processing a single page | 0s (no timeout) |
skip_images | Boolean | Whether to skip downloading images | false |
max_image_size_bytes | Integer | Maximum allowed image size | 0 (unlimited) |
max_page_size_bytes | Integer | Maximum HTML page body size | 52428800 (50 MiB) |
enable_output_mapping | Boolean | Enable URL-to-file mapping log | false |
output_mapping_filename | String | Filename for the URL-to-file mapping log | "url_to_file_map.tsv" |
enable_metadata_yaml | Boolean | Enable detailed YAML metadata output file | false |
metadata_yaml_filename | String | Filename for the YAML metadata output file | "metadata.yaml" |
enable_jsonl_output | Boolean | Enable JSONL page output for RAG pipelines | false |
jsonl_output_filename | String | Filename for JSONL output | "pages.jsonl" |
enable_incremental | Boolean | Enable incremental crawling globally | false |
db_gc_interval | Duration | BadgerDB garbage collection interval | 10m |
chunking.enabled | Boolean | Enable token-aware content chunking | false |
chunking.max_chunk_size | Integer | Max chunk size in tokens | 512 |
chunking.chunk_overlap | Integer | Overlap between chunks in tokens | 50 |
chunking.output_filename | String | Chunks output filename | "chunks.jsonl" |
http_client_settings | Object | HTTP client configuration | *(see below)* |
sites | Map | Site-specific configurations | *(required)* |
HTTP Client Settings: (These are global and cannot be overridden per site in the current structure)
timeout: Overall request timeout (Default in code: 45s)max_idle_conns: Total idle connections (Default in code: 100)max_idle_conns_per_host: Idle connections per host (Default in code: 2)idle_conn_timeout: Timeout for idle connections (Default in code: 90s)tls_handshake_timeout: TLS handshake timeout (Default in code: 10s)expect_continue_timeout: "100 Continue" timeout (Default in code: 1s)force_attempt_http2: null (use Go default), true, or false. (Default in code: null)dialer_timeout: TCP connection timeout (Default in code: 15s)dialer_keep_alive: TCP keep-alive interval (Default in code: 30s)Site-Specific Configuration Options:
start_urls: Array of starting URLs for crawling (Required)allowed_domain: Restrict crawling to this domain (Required)allowed_path_prefix: Further restrict crawling to URLs with this prefix (Required)content_selector: CSS selector for main content extraction, or "auto" for automatic detection (Required)max_depth: Maximum crawl depth from start URLs (0 = unlimited)delay_per_host: Override global delay setting for this sitedisallowed_path_patterns: Array of regex patterns for URLs to skiplink_extraction_selectors: Array of CSS selectors for additional link extraction areasrespect_nofollow: Boolean. Whether to respect rel="nofollow" linksuser_agent: String. Override global user agent for this siteskip_images: Override global image setting for this sitemax_image_size_bytes: Integer. Override global max image size for this siteallowed_image_domains: Array of domains from which to download imagesdisallowed_image_domains: Array of domains to block image downloads fromenable_output_mapping: true or false. Override global URL-to-file mapping enablement for this siteoutput_mapping_filename: String. Override global URL-to-file mapping filename for this siteenable_metadata_yaml: true or false. Override global YAML metadata output enablement for this sitemetadata_yaml_filename: String. Override global YAML metadata filename for this siteenable_jsonl_output: true or false. Override global JSONL output enablement for this sitejsonl_output_filename: String. Override global JSONL output filename for this sitechunking.enabled: true or false. Override global chunking enablement for this sitechunking.max_chunk_size: Integer. Override global max chunk size for this sitechunking.chunk_overlap: Integer. Override global chunk overlap for this sitechunking.output_filename: String. Override global chunks output filename for this sitecrawl / resume:
| Flag | Description | Default |
|---|---|---|
-config <path> | Path to config file | config.yaml |
-site <key> | Site key from config (single site) | - |
-sites <keys> | Comma-separated site keys for parallel crawling | - |
--all-sites | Crawl all configured sites in parallel | false |
-loglevel <level> | Log level (debug, info, warn, error, fatal) | info |
-pprof <addr> | pprof server address. Only effective in builds with -tags pprof; default builds log a warning and ignore the flag | "" (disabled) |
-incremental | Enable incremental crawling (skip unchanged pages) | false |
-full | Force full crawl (ignore incremental settings) | false |
Note: One of -site, -sites, or --all-sites is required.
validate:
| Flag | Description | Default |
|---|---|---|
-config <path> | Path to config file | config.yaml |
-site <key> | Site key to validate (optional, validates all if empty) | - |
list-sites:
| Flag | Description | Default |
|---|---|---|
-config <path> | Path to config file | config.yaml |
mcp-server:
| Flag | Description | Default |
|---|---|---|
-config <path> | Path to config file | config.yaml |
-transport <type> | Transport type (stdio, sse) | stdio |
-port <num> | HTTP port (for SSE transport) | 8080 |
-loglevel <level> | Log level (debug, info, warn, error) | info |
watch:
| Flag | Description | Default |
|---|---|---|
-config <path> | Path to config file | config.yaml |
-site <key> | Site key to watch (single site) | - |
-sites <keys> | Comma-separated site keys to watch | - |
--all-sites | Watch all configured sites | false |
-interval <duration> | Crawl interval (e.g., 1h, 24h, 7d) | 24h |
-loglevel <level> | Log level (debug, info, warn, error) | info |
Note: One of -site, -sites, or --all-sites is required.
./doc-scraper crawl --all-sites
./doc-scraper watch --all-sites -interval 7d ```
Add to your Claude Code configuration (claude_code_config.json):
{
"mcpServers": {
"doc-scraper": {
"command": "/path/to/doc-scraper",
"args": ["mcp-server", "-config", "/path/to/config.yaml"]
}
}
}
该工具使用Go语言开发,具有较好的性能和扩展性,适合用于大规模文档爬取任务,但需要注意爬取网站的协议和政策
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
总体来看,开源AI工具:Go web爬虫 是一款质量良好的AI工具,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | doc-scraper |
| 原始描述 | 开源AI工具:Go web crawler to scrape documentation sites and convert content to clean Markdo。⭐91 · Go |
| Topics | web-scrapergollmdata-preparation |
| GitHub | https://github.com/Sriram-PR/doc-scraper |
| License | Apache-2.0 |
| 语言 | Go |
收录时间:2026-05-21 · 更新时间:2026-05-22 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。