Crawl API
Agent-driven web crawling with AI-powered classification, structured data extraction, and recursive link following.
Credits: 1/page (basic) | 2/page (with classify_documents, extraction_schema, classifiers, or intent) | FREE (summary)
POST /v1/crawls
Deploy an agent to crawl and extract from web sources.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | 1-100 seed URLs to crawl |
max_pages | int | 10 | Maximum pages to crawl (1-1000) |
processing_mode | string | "async" | "async" or "sync" |
classify_documents | bool | false | AI classification (triggers advanced pricing) |
extraction_schema | object | - | Custom fields to extract (triggers advanced) |
generate_summary | bool | false | AI summary of crawled docs (FREE) |
expansion | string | "none" | URL expansion: "internal", "external", "both", "none" |
intent | string | - | Natural language crawl goal (10-2000 chars). Activates neural URL scoring and junk filtering. Triggers advanced pricing (2 credits/page). |
auto_generate_intent | bool | false | Auto-generate intent from extraction_schema field descriptions |
session_id | UUID | - | Link to existing session |
dry_run | bool | false | Validate without executing |
Advanced Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
follow_links | bool | false | Recursive crawling (legacy, prefer expansion) |
link_extraction_config | object | - | max_depth (1-10), url_patterns[], exclude_patterns[], detect_pagination |
classifiers[] | array | - | AI filters: `{type: "url" |
budget_config | object | - | max_pages, max_depth, max_credits |
use_cache | bool | false | Global cache (50% savings on hits) |
cache_max_age | int | 86400 | Cache TTL in seconds |
premium_proxy | bool | false | Residential IPs |
ultra_premium | bool | false | Mobile residential IPs (highest success) |
retry_strategy | string | "auto" | "none", "auto" (escalate tiers), "premium_only" |
Processing Modes
| Mode | Behavior | Best For |
|---|---|---|
async | Returns immediately, poll for results | Large crawls |
sync | Waits for completion (max_pages <= 10) | Small crawls, real-time |
Smart Crawl
When expansion is enabled, the agent prioritizes URLs using intelligent algorithms:
- OPIC: Streaming PageRank approximation that updates as you crawl
- HITS: Hub and authority scoring for link aggregators and authoritative sources
- Anchor text relevance: Link text matching your intent gets priority
Efficiency: 135x average discovery ratio vs traditional BFS crawling.
Intent-Driven Crawling
With intent + expansion, the agent activates neural URL scoring that dramatically improves crawl efficiency on navigation-heavy sites:
- LLM-driven URL prioritization: Every candidate URL is scored for relevance before fetching
- Junk URL filtering: Known non-content URLs are filtered before they enter the queue
- Intent-driven scoring weights: Scoring shifts from graph-dominated to model-dominated
- Depth auto-adjustment: Default depth 1->2 (or 3 with "individual"/"specific"/"detail" keywords)
- Budget allocation: 30% hub discovery, 60% detail expansion, 10% exploration
Pricing: Intent triggers advanced pricing (2 credits/page instead of 1).
Examples
# Basic crawl (5 credits)
curl -X POST https://api.zipf.ai/v1/crawls \
-H "Authorization: Bearer $ZIPF_API_KEY" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'
# Intent-driven crawl with extraction
curl -X POST https://api.zipf.ai/v1/crawls \
-H "Authorization: Bearer $ZIPF_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/team"],
"max_pages": 30,
"expansion": "internal",
"intent": "Find individual team member bio pages with background and role",
"extraction_schema": {
"name": "Extract the person full name",
"role": "Extract job title or role"
}
}'
Extraction Schema
Extract structured data using AI. Triggers advanced pricing.
Format: {"field_name": "Extraction instruction (min 10 chars)"}. Field names: [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.
Response: extracted_data (field->value), extraction_metadata (fields_extracted, confidence_scores).
GET /v1/crawls
List crawl jobs. Query: limit (1-100), offset, status (pending/running/completed/failed/cancelled)
GET /v1/crawls/{crawl_id}
Get detailed crawl status, results, classifications, and summary.
Response includes: access_level (owner/org_viewer/public).
When Smart Crawl was active, includes smart_crawl section:
| Field | Description |
|---|---|
smart_crawl.enabled | Whether Smart Crawl was active |
smart_crawl.algorithm | Algorithm used (opic) |
smart_crawl.stats | total_nodes, total_edges, avg_inlinks, graph_density |
smart_crawl.top_pages[] | url, importance, inlinks |
smart_crawl.domain_authority | Domain -> authority score map |
smart_crawl.efficiency | coverage_ratio, importance_captured, avg_importance_per_page |
smart_crawl.hubs[] | url, score |
smart_crawl.authorities[] | url, score |
Each result also includes: importance_score, inlink_count, outlink_count, hub_score, authority_score, anchor_texts[].
DELETE /v1/crawls/{crawl_id}
Cancel a running/scheduled crawl. Releases reserved credits.
POST /v1/crawls/{crawl_id}/abort
Abort a running crawl job and release reserved credits.
POST /v1/crawls/{crawl_id}/share
Toggle sharing. Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}
Crawls in sessions inherit org sharing from parent session.
POST /v1/crawls/estimate
Estimate crawl cost without executing.
Parameters: Same as POST /v1/crawls.
Response: cost_estimate, would_succeed, credits_balance
GET /v1/public/crawls/{crawl_id}
Access publicly shared crawl (no auth).
POST /v1/sessions/{session_id}/crawl
Crawl within a session context. Same parameters as POST /v1/crawls plus:
filter_seen_urls(Bool, default: true) - Exclude URLs already in session
See Sessions API for details.