Crawl API
Web crawling with AI-powered classification, structured data extraction, and recursive link following.
POST /api/v1/crawls
Create and execute a web crawl job.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | 1-100 seed URLs to crawl |
max_pages | int | 10 | Maximum pages to crawl (1-1000) |
processing_mode | string | "async" | "async", "sync", or "webhook" |
classify_documents | bool | false | AI classification (opt-in, triggers advanced pricing) |
extraction_schema | object | - | Custom fields to extract (triggers advanced) |
generate_summary | bool | false | AI summary of crawled docs (FREE) |
expansion | string | "none" | URL expansion: "internal", "external", "both", "none" |
intent | string | - | Natural language crawl goal (10-2000 chars). Guides URL prioritization when expansion is enabled |
auto_generate_intent | bool | false | Auto-generate intent from extraction_schema field descriptions |
session_id | UUID | - | Link to existing session |
dry_run | bool | false | Validate without executing |
webhook_url | URL | - | Required if processing_mode="webhook" |
Advanced Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
follow_links | bool | false | Enable recursive crawling (legacy, prefer expansion) |
link_extraction_config | object | - | max_depth (1-10), url_patterns[], exclude_patterns[], detect_pagination |
classifiers[] | array | - | AI filters: `{type: "url" |
budget_config | object | - | max_pages, max_depth, max_credits |
use_cache | bool | false | Enable global cache (50% savings on hits) |
cache_max_age | int | 86400 | Cache TTL in seconds |
premium_proxy | bool | false | Start at premium proxy tier (residential IPs) |
ultra_premium | bool | false | Start at ultra premium tier (mobile residential IPs, highest success rate) |
retry_strategy | string | "auto" | "none" (single attempt), "auto" (escalate standard→premium→ultra_premium), "premium_only" (start at premium, escalate to ultra) |
Credits
| Mode | Cost |
|---|---|
| Basic (default - no AI features) | 1 credit/page |
Advanced (classify_documents: true, extraction, or classifiers) | 2 credits/page |
Cached (use_cache: true + cache hit) | 1 credit/page |
| Summary | FREE |
Advanced pricing triggered by: classify_documents: true, extraction_schema, or classifiers[]
Processing Modes
| Mode | Behavior | Best For |
|---|---|---|
async | Returns immediately, poll for results | Large crawls, queued processing |
sync | Waits for completion (use max_pages ≤ 10) | Small crawls, real-time needs |
webhook | Returns immediately, POSTs results to URL | Event-driven architectures |
Smart Crawl
When expansion is enabled, URLs are automatically prioritized using intelligent algorithms. High-value pages are crawled first, maximizing information capture within your page budget.
Algorithms:
- OPIC (Online Page Importance Computation): Streaming PageRank approximation that updates as you crawl
- HITS: Hub and authority scoring to find link aggregators and authoritative sources
- Anchor text relevance: Uses link text to prioritize pages matching your intent
Efficiency: 135x average discovery ratio - finds target pages in 5-6 steps vs 20+ traditional BFS crawling.
Intent-Driven Crawling
When you provide an intent parameter alongside expansion, the crawler uses your goal to guide URL prioritization:
- Depth auto-adjustment: Default depth increases from 1 to 2 (or 3 when intent mentions "individual", "specific", "detail" etc.)
- URL pattern boosting: URLs matching intent-related patterns get +0.3 score boost
- Budget allocation: Page budget is split into hub discovery (30%), detail expansion (60%), and exploration reserve (10%)
- Navigation penalty recovery: The -0.4 penalty on
/about/,/team/paths is partially recovered (+0.2) when matching intent patterns - Effectiveness telemetry: Metrics stored per crawl:
extraction_success_rate,budget_efficiency,depth_to_first_target
Example: Without intent, a 30-page crawl of example.com/team typically reaches 0 individual partner pages (budget exhausted on navigation). With intent: "Find individual partner bio pages", the same crawl reaches 15+ partner pages.
No extra cost: Intent-driven prioritization is heuristic-based and does not consume additional credits.
Response Enhancements:
When Smart Crawl is active, the GET /api/v1/crawls/{id} response includes a smart_crawl section:
{
"smart_crawl": {
"enabled": true,
"algorithm": "opic",
"stats": {
"total_nodes": 150,
"total_edges": 420,
"avg_inlinks": 2.8,
"graph_density": 0.0187
},
"top_pages": [
{"url": "https://example.com/about", "importance": 0.0823, "inlinks": 12}
],
"domain_authority": {
"example.com": 0.85,
"blog.example.com": 0.12
},
"efficiency": {
"coverage_ratio": 0.75,
"importance_captured": 0.42,
"avg_importance_per_page": 0.021
},
"hubs": [{"url": "...", "score": 0.15}],
"authorities": [{"url": "...", "score": 0.22}],
"computed_at": "2026-01-27T10:30:00Z"
}
}
Each result in results[] also includes Smart Crawl data when available:
{
"url": "https://example.com/team",
"title": "Our Team",
"content": "...",
"smart_crawl": {
"importance_score": 0.0823,
"inlink_count": 12,
"outlink_count": 5,
"hub_score": 0.02,
"authority_score": 0.15,
"anchor_texts": ["team", "about us", "our people"]
}
}
Examples
# Basic crawl (5 credits: 5 × 1)
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer wvr_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'
# Sync crawl with classification (10 credits: 5 × 2)
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer wvr_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "max_pages": 5, "processing_mode": "sync"}'
# Crawl with extraction (10 credits: 5 × 2)
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer wvr_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/blog"],
"max_pages": 5,
"classify_documents": false,
"extraction_schema": {
"title": "Extract the main article title",
"author": "Extract the author name"
}
}'
# Internal link expansion with Smart Crawl
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer wvr_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "max_pages": 20, "expansion": "internal"}'
# Intent-driven crawl (reaches detail pages behind listings)
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer wvr_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/team"],
"max_pages": 30,
"expansion": "internal",
"intent": "Find individual team member bio pages with background and role",
"extraction_schema": {
"name": "Extract the person full name",
"role": "Extract job title or role"
}
}'
# Dry run - estimate cost
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer wvr_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "max_pages": 50, "dry_run": true}'
Extraction Schema
Extract structured data using AI. Triggers advanced pricing.
Schema format:
{
"extraction_schema": {
"field_name": "Extraction instruction (min 10 chars)"
}
}
Validation: Field names must match [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.
Response includes:
{
"extracted_data": {"title": "Article Title", "author": "John Smith"},
"extraction_metadata": {
"fields_extracted": 2,
"confidence_scores": {"title": 0.98, "author": 0.95}
}
}
GET /api/v1/crawls/suggest-schema
Get information about the suggest-schema endpoint, including parameters and response format.
POST /api/v1/crawls/suggest-schema
Analyze a URL and suggest extraction fields. Cost: 2 credits
curl -X POST https://www.zipf.ai/api/v1/crawls/suggest-schema \
-H "Authorization: Bearer wvr_TOKEN" \
-d '{"url": "https://example.com/product"}'
Response: detected_page_type, suggested_schema, field_metadata with confidence scores.
Supported page types: E-commerce, Blog, News, Documentation, Job Listing, Recipe, Event, Company About
GET /api/v1/crawls/preview-extraction
Get information about the preview-extraction endpoint, including parameters and response format.
POST /api/v1/crawls/preview-extraction
Test extraction schema on a single URL. Cost: 1 credit
curl -X POST https://www.zipf.ai/api/v1/crawls/preview-extraction \
-H "Authorization: Bearer wvr_TOKEN" \
-d '{
"url": "https://example.com/article",
"extraction_schema": {"title": "Extract the article title"}
}'
GET /api/v1/crawls
List crawl jobs with pagination.
Query params: limit (1-100), offset, status (pending/running/completed/failed/cancelled)
GET /api/v1/crawls/{id}
Get detailed crawl status, results, classifications, summary, and Smart Crawl analysis.
Response includes:
access_level- Your access level:owner,org_viewer, orpublicsmart_crawlsection (when expansion was used) with link graph stats, top pages by importance, domain authority, and efficiency metrics. See Smart Crawl section above.
DELETE /api/v1/crawls/{id}
Cancel a running/scheduled crawl. Releases reserved credits.
Link Graph API
Detailed link graph analysis for crawls and sessions. Cost: FREE
GET /api/v1/crawls/{id}/link-graph
Get paginated link graph data for a single crawl.
Query Parameters:
| Parameter | Default | Description |
|---|---|---|
include_nodes | true | Include node data in response |
include_edges | false | Include edge data (can be large) |
node_limit | 100 | Max nodes to return (1-1000) |
sort_by | pagerank | Sort nodes by: pagerank, inlinks, authority, hub |
min_importance | 0 | Filter nodes by minimum PageRank score |
Response:
{
"crawl_id": "uuid",
"status": "completed",
"pages_crawled": 20,
"link_graph": {
"stats": {
"total_nodes": 150,
"total_edges": 420,
"avg_inlinks": 2.8,
"avg_outlinks": 3.1,
"max_inlinks": 15,
"max_outlinks": 25,
"graph_density": 0.0187,
"top_domains": [{"domain": "example.com", "count": 120}]
},
"domain_authority": {
"example.com": 0.85,
"docs.example.com": 0.08
},
"computed_at": "2026-01-27T10:30:00Z",
"nodes": [
{
"url": "https://example.com/about",
"pagerank": 0.0823,
"inlinks": 12,
"outlinks": 5,
"hub_score": 0.02,
"authority_score": 0.15,
"domain": "example.com"
}
],
"edges": [
{"from": "https://example.com", "to": "https://example.com/about", "anchor": "About Us"}
]
},
"pagination": {
"nodes_returned": 100,
"nodes_total": 150,
"edges_returned": 200,
"edges_total": 420,
"sort_by": "pagerank",
"min_importance": 0,
"node_limit": 100
}
}
Example:
# Get top 50 pages by PageRank
curl "https://www.zipf.ai/api/v1/crawls/{id}/link-graph?node_limit=50&sort_by=pagerank" \
-H "Authorization: Bearer wvr_TOKEN"
# Get pages with high authority scores and their edges
curl "https://www.zipf.ai/api/v1/crawls/{id}/link-graph?sort_by=authority&include_edges=true&min_importance=0.01" \
-H "Authorization: Bearer wvr_TOKEN"
Session-Level Link Graphs
For aggregated link graphs across all crawls in a session, see Sessions API - Link Graph.
GET /api/v1/crawls/{id}/execute
Get execution options and history for a crawl job.
Response: Available execution options, cancellable statuses, and execution history.
POST /api/v1/crawls/{id}/execute
Execute a scheduled crawl immediately.
Params: force, webhook_url, browser_automation, scraperapi
POST /api/v1/crawls/{id}/share
Toggle public or organization sharing.
Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}
Note: Crawls within sessions inherit org sharing from their parent session.
GET /api/v1/public/crawl/{id}
Access publicly shared crawl (no auth required).
GET /api/v1/search/jobs/{id}/crawls
Get all crawls created from a search job's URLs (via "Crawl All" feature).
Response: Array of crawl jobs created from the search results.
Workflow Recrawl API
Importance-based recrawl prioritization for workflows. High-value pages are recrawled more frequently.
Prerequisites: Workflow with session, Smart Crawl enabled
Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/workflows/{id}/recrawl | Get recrawl config and status |
| POST | /api/v1/workflows/{id}/recrawl | Enable/update recrawl |
| DELETE | /api/v1/workflows/{id}/recrawl | Disable recrawl |
| PATCH | /api/v1/workflows/{id}/recrawl | Trigger immediate recrawl |
POST Parameters
| Parameter | Default | Description |
|---|---|---|
strategy | "importance" | "importance", "time", "hybrid" |
min_interval_hours | 24 | Minimum hours between recrawls |
max_interval_hours | 720 | Maximum hours before recrawl |
importance_multiplier | 2.0 | How much importance affects frequency |
change_detection | true | Track content changes via SHA-256 |
priority_threshold | 0.0 | Only recrawl URLs above this score |
execute_now | false | Immediately execute after enabling |
Example
# Enable importance-based recrawl
curl -X POST https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
-H "Authorization: Bearer wvr_TOKEN" \
-d '{"strategy": "importance", "min_interval_hours": 12, "execute_now": true}'
# Check status
curl https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
-H "Authorization: Bearer wvr_TOKEN"
# Trigger manual recrawl
curl -X PATCH https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
-H "Authorization: Bearer wvr_TOKEN"
Recrawl Interval Formula
interval = max_interval - (importance × multiplier × (max_interval - min_interval))
With defaults: importance 1.0 → 24h, importance 0.5 → ~15 days, importance 0.1 → ~27 days
Smart Crawl Deep Dive
How OPIC Works
OPIC (Online Page Importance Computation) is a streaming approximation of PageRank that works incrementally as you crawl:
- Initialization: Each seed URL starts with equal "cash" (priority score)
- Cash Distribution: When a page is crawled, its cash moves to linked pages
- History Accumulation: Cash that passes through a node accumulates in "history" (converges to PageRank)
- Damping Factor: 85% of cash flows through links, 15% teleports randomly
Benefits over batch PageRank:
- No need to wait for full crawl to start prioritizing
- Works on infinite/streaming web data
- Handles dead-end pages automatically
HITS Algorithm
HITS (Hyperlink-Induced Topic Search) identifies two types of important pages:
- Hubs: Pages that link to many authoritative pages (e.g., resource lists, directories)
- Authorities: Pages that are linked by many hubs (e.g., official docs, primary sources)
Use sort_by=hub or sort_by=authority in the link-graph endpoint to find these.
Anchor Text Relevance
Links with descriptive anchor text are weighted higher for pages matching your crawl intent. For example, when crawling for "team information", a link with anchor text "Meet the Team" is prioritized over a link with anchor "Click here".
URL Category Scoring
Each discovered URL is classified into categories with different priority weights:
| Category | Description | Weight |
|---|---|---|
content | Main content pages (articles, products, docs) | High |
navigation | Site navigation, menus, sitemaps | Medium |
functional | Login, search, cart pages | Low |
resource | Images, PDFs, downloads | Low |
external_content | High-value external links | Medium |
Efficiency Metrics
The efficiency object in Smart Crawl responses helps you understand crawl quality:
| Metric | Description |
|---|---|
coverage_ratio | Pages crawled / max_pages budget |
importance_captured | Sum of PageRank of crawled pages (0-1) |
avg_importance_per_page | importance_captured / pages_crawled |
Interpretation:
importance_captured > 0.5= Excellent - captured majority of site importanceavg_importance_per_page > 0.02= Good - each page is valuable- High coverage + low importance = Consider increasing page budget or refining seed URLs
Session-Level Aggregation
When using sessions with multiple crawls, link graphs are automatically merged:
- Cumulative PageRank: Recomputed across all crawl results
- Domain Authority: Aggregated across all domains seen
- Cross-Crawl Relationships: Links between pages from different crawls are captured
- Deduplication: Same page discovered in multiple crawls is tracked once
This enables powerful multi-step research workflows where each crawl builds on previous discoveries.
See Also
- Sessions API - Execute crawls within sessions
- Common Topics - Authentication, credits, errors