Crawl API

Web crawling with AI-powered classification, structured data extraction, and recursive link following.

POST /api/v1/crawls

Create and execute a web crawl job.

Parameters

Parameter	Type	Default	Description
`urls`	string[]	required	1-100 seed URLs to crawl
`max_pages`	int	10	Maximum pages to crawl (1-1000)
`processing_mode`	string	`"async"`	`"async"`, `"sync"`, or `"webhook"`
`classify_documents`	bool	false	AI classification (opt-in, triggers advanced pricing)
`extraction_schema`	object	-	Custom fields to extract (triggers advanced)
`generate_summary`	bool	false	AI summary of crawled docs (FREE)
`expansion`	string	`"none"`	URL expansion: `"internal"`, `"external"`, `"both"`, `"none"`
`intent`	string	-	Natural language crawl goal (10-2000 chars). Guides URL prioritization when expansion is enabled
`auto_generate_intent`	bool	false	Auto-generate intent from `extraction_schema` field descriptions
`session_id`	UUID	-	Link to existing session
`dry_run`	bool	false	Validate without executing
`webhook_url`	URL	-	Required if `processing_mode="webhook"`

Advanced Parameters

Parameter	Type	Default	Description
`follow_links`	bool	false	Enable recursive crawling (legacy, prefer `expansion`)
`link_extraction_config`	object	-	`max_depth` (1-10), `url_patterns[]`, `exclude_patterns[]`, `detect_pagination`
`classifiers[]`	array	-	AI filters: `{type: "url"
`budget_config`	object	-	`max_pages`, `max_depth`, `max_credits`
`use_cache`	bool	false	Enable global cache (50% savings on hits)
`cache_max_age`	int	86400	Cache TTL in seconds
`premium_proxy`	bool	false	Start at premium proxy tier (residential IPs)
`ultra_premium`	bool	false	Start at ultra premium tier (mobile residential IPs, highest success rate)
`retry_strategy`	string	`"auto"`	`"none"` (single attempt), `"auto"` (escalate standard→premium→ultra_premium), `"premium_only"` (start at premium, escalate to ultra)

Credits

Mode	Cost
Basic (default - no AI features)	1 credit/page
Advanced (`classify_documents: true`, extraction, or classifiers)	2 credits/page
Cached (`use_cache: true` + cache hit)	1 credit/page
Summary	FREE

Advanced pricing triggered by: classify_documents: true, extraction_schema, or classifiers[]

Processing Modes

Mode	Behavior	Best For
`async`	Returns immediately, poll for results	Large crawls, queued processing
`sync`	Waits for completion (use `max_pages ≤ 10`)	Small crawls, real-time needs
`webhook`	Returns immediately, POSTs results to URL	Event-driven architectures

Smart Crawl

When expansion is enabled, URLs are automatically prioritized using intelligent algorithms. High-value pages are crawled first, maximizing information capture within your page budget.

Algorithms:

OPIC (Online Page Importance Computation): Streaming PageRank approximation that updates as you crawl
HITS: Hub and authority scoring to find link aggregators and authoritative sources
Anchor text relevance: Uses link text to prioritize pages matching your intent

Efficiency: 135x average discovery ratio - finds target pages in 5-6 steps vs 20+ traditional BFS crawling.

Intent-Driven Crawling

When you provide an intent parameter alongside expansion, the crawler uses your goal to guide URL prioritization:

Depth auto-adjustment: Default depth increases from 1 to 2 (or 3 when intent mentions "individual", "specific", "detail" etc.)
URL pattern boosting: URLs matching intent-related patterns get +0.3 score boost
Budget allocation: Page budget is split into hub discovery (30%), detail expansion (60%), and exploration reserve (10%)
Navigation penalty recovery: The -0.4 penalty on /about/, /team/ paths is partially recovered (+0.2) when matching intent patterns
Effectiveness telemetry: Metrics stored per crawl: extraction_success_rate, budget_efficiency, depth_to_first_target

Example: Without intent, a 30-page crawl of example.com/team typically reaches 0 individual partner pages (budget exhausted on navigation). With intent: "Find individual partner bio pages", the same crawl reaches 15+ partner pages.

No extra cost: Intent-driven prioritization is heuristic-based and does not consume additional credits.

Response Enhancements:

When Smart Crawl is active, the GET /api/v1/crawls/{id} response includes a smart_crawl section:

{
  "smart_crawl": {
    "enabled": true,
    "algorithm": "opic",
    "stats": {
      "total_nodes": 150,
      "total_edges": 420,
      "avg_inlinks": 2.8,
      "graph_density": 0.0187
    },
    "top_pages": [
      {"url": "https://example.com/about", "importance": 0.0823, "inlinks": 12}
    ],
    "domain_authority": {
      "example.com": 0.85,
      "blog.example.com": 0.12
    },
    "efficiency": {
      "coverage_ratio": 0.75,
      "importance_captured": 0.42,
      "avg_importance_per_page": 0.021
    },
    "hubs": [{"url": "...", "score": 0.15}],
    "authorities": [{"url": "...", "score": 0.22}],
    "computed_at": "2026-01-27T10:30:00Z"
  }
}

Each result in results[] also includes Smart Crawl data when available:

{
  "url": "https://example.com/team",
  "title": "Our Team",
  "content": "...",
  "smart_crawl": {
    "importance_score": 0.0823,
    "inlink_count": 12,
    "outlink_count": 5,
    "hub_score": 0.02,
    "authority_score": 0.15,
    "anchor_texts": ["team", "about us", "our people"]
  }
}

Examples

# Basic crawl (5 credits: 5 × 1)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'

# Sync crawl with classification (10 credits: 5 × 2)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 5, "processing_mode": "sync"}'

# Crawl with extraction (10 credits: 5 × 2)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/blog"],
    "max_pages": 5,
    "classify_documents": false,
    "extraction_schema": {
      "title": "Extract the main article title",
      "author": "Extract the author name"
    }
  }'

# Internal link expansion with Smart Crawl
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 20, "expansion": "internal"}'

# Intent-driven crawl (reaches detail pages behind listings)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/team"],
    "max_pages": 30,
    "expansion": "internal",
    "intent": "Find individual team member bio pages with background and role",
    "extraction_schema": {
      "name": "Extract the person full name",
      "role": "Extract job title or role"
    }
  }'

# Dry run - estimate cost
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 50, "dry_run": true}'

Extraction Schema

Extract structured data using AI. Triggers advanced pricing.

Schema format:

{
  "extraction_schema": {
    "field_name": "Extraction instruction (min 10 chars)"
  }
}

Validation: Field names must match [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.

Response includes:

{
  "extracted_data": {"title": "Article Title", "author": "John Smith"},
  "extraction_metadata": {
    "fields_extracted": 2,
    "confidence_scores": {"title": 0.98, "author": 0.95}
  }
}

GET /api/v1/crawls/suggest-schema

Get information about the suggest-schema endpoint, including parameters and response format.

POST /api/v1/crawls/suggest-schema

Analyze a URL and suggest extraction fields. Cost: 2 credits

curl -X POST https://www.zipf.ai/api/v1/crawls/suggest-schema \
  -H "Authorization: Bearer wvr_TOKEN" \
  -d '{"url": "https://example.com/product"}'

Response: detected_page_type, suggested_schema, field_metadata with confidence scores.

Supported page types: E-commerce, Blog, News, Documentation, Job Listing, Recipe, Event, Company About

GET /api/v1/crawls/preview-extraction

Get information about the preview-extraction endpoint, including parameters and response format.

POST /api/v1/crawls/preview-extraction

Test extraction schema on a single URL. Cost: 1 credit

curl -X POST https://www.zipf.ai/api/v1/crawls/preview-extraction \
  -H "Authorization: Bearer wvr_TOKEN" \
  -d '{
    "url": "https://example.com/article",
    "extraction_schema": {"title": "Extract the article title"}
  }'

GET /api/v1/crawls

List crawl jobs with pagination.

Query params: limit (1-100), offset, status (pending/running/completed/failed/cancelled)

GET /api/v1/crawls/{id}

Get detailed crawl status, results, classifications, summary, and Smart Crawl analysis.

Response includes:

access_level - Your access level: owner, org_viewer, or public
smart_crawl section (when expansion was used) with link graph stats, top pages by importance, domain authority, and efficiency metrics. See Smart Crawl section above.

DELETE /api/v1/crawls/{id}

Cancel a running/scheduled crawl. Releases reserved credits.

Link Graph API

Detailed link graph analysis for crawls and sessions. Cost: FREE

GET /api/v1/crawls/{id}/link-graph

Get paginated link graph data for a single crawl.

Query Parameters:

Parameter	Default	Description
`include_nodes`	true	Include node data in response
`include_edges`	false	Include edge data (can be large)
`node_limit`	100	Max nodes to return (1-1000)
`sort_by`	`pagerank`	Sort nodes by: `pagerank`, `inlinks`, `authority`, `hub`
`min_importance`	0	Filter nodes by minimum PageRank score

Response:

{
  "crawl_id": "uuid",
  "status": "completed",
  "pages_crawled": 20,
  "link_graph": {
    "stats": {
      "total_nodes": 150,
      "total_edges": 420,
      "avg_inlinks": 2.8,
      "avg_outlinks": 3.1,
      "max_inlinks": 15,
      "max_outlinks": 25,
      "graph_density": 0.0187,
      "top_domains": [{"domain": "example.com", "count": 120}]
    },
    "domain_authority": {
      "example.com": 0.85,
      "docs.example.com": 0.08
    },
    "computed_at": "2026-01-27T10:30:00Z",
    "nodes": [
      {
        "url": "https://example.com/about",
        "pagerank": 0.0823,
        "inlinks": 12,
        "outlinks": 5,
        "hub_score": 0.02,
        "authority_score": 0.15,
        "domain": "example.com"
      }
    ],
    "edges": [
      {"from": "https://example.com", "to": "https://example.com/about", "anchor": "About Us"}
    ]
  },
  "pagination": {
    "nodes_returned": 100,
    "nodes_total": 150,
    "edges_returned": 200,
    "edges_total": 420,
    "sort_by": "pagerank",
    "min_importance": 0,
    "node_limit": 100
  }
}

Example:

# Get top 50 pages by PageRank
curl "https://www.zipf.ai/api/v1/crawls/{id}/link-graph?node_limit=50&sort_by=pagerank" \
  -H "Authorization: Bearer wvr_TOKEN"

# Get pages with high authority scores and their edges
curl "https://www.zipf.ai/api/v1/crawls/{id}/link-graph?sort_by=authority&include_edges=true&min_importance=0.01" \
  -H "Authorization: Bearer wvr_TOKEN"

Session-Level Link Graphs

For aggregated link graphs across all crawls in a session, see Sessions API - Link Graph.

GET /api/v1/crawls/{id}/execute

Get execution options and history for a crawl job.

Response: Available execution options, cancellable statuses, and execution history.

POST /api/v1/crawls/{id}/execute

Execute a scheduled crawl immediately.

Params: force, webhook_url, browser_automation, scraperapi

POST /api/v1/crawls/{id}/share

Toggle public or organization sharing.

Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}

Note: Crawls within sessions inherit org sharing from their parent session.

GET /api/v1/public/crawl/{id}

Access publicly shared crawl (no auth required).

GET /api/v1/search/jobs/{id}/crawls

Get all crawls created from a search job's URLs (via "Crawl All" feature).

Response: Array of crawl jobs created from the search results.

Workflow Recrawl API

Importance-based recrawl prioritization for workflows. High-value pages are recrawled more frequently.

Prerequisites: Workflow with session, Smart Crawl enabled

Endpoints

Method	Endpoint	Description
GET	`/api/v1/workflows/{id}/recrawl`	Get recrawl config and status
POST	`/api/v1/workflows/{id}/recrawl`	Enable/update recrawl
DELETE	`/api/v1/workflows/{id}/recrawl`	Disable recrawl
PATCH	`/api/v1/workflows/{id}/recrawl`	Trigger immediate recrawl

POST Parameters

Parameter	Default	Description
`strategy`	`"importance"`	`"importance"`, `"time"`, `"hybrid"`
`min_interval_hours`	24	Minimum hours between recrawls
`max_interval_hours`	720	Maximum hours before recrawl
`importance_multiplier`	2.0	How much importance affects frequency
`change_detection`	true	Track content changes via SHA-256
`priority_threshold`	0.0	Only recrawl URLs above this score
`execute_now`	false	Immediately execute after enabling

Example

# Enable importance-based recrawl
curl -X POST https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
  -H "Authorization: Bearer wvr_TOKEN" \
  -d '{"strategy": "importance", "min_interval_hours": 12, "execute_now": true}'

# Check status
curl https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
  -H "Authorization: Bearer wvr_TOKEN"

# Trigger manual recrawl
curl -X PATCH https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
  -H "Authorization: Bearer wvr_TOKEN"

Recrawl Interval Formula

interval = max_interval - (importance × multiplier × (max_interval - min_interval))

With defaults: importance 1.0 → 24h, importance 0.5 → ~15 days, importance 0.1 → ~27 days

Smart Crawl Deep Dive

How OPIC Works

OPIC (Online Page Importance Computation) is a streaming approximation of PageRank that works incrementally as you crawl:

Initialization: Each seed URL starts with equal "cash" (priority score)
Cash Distribution: When a page is crawled, its cash moves to linked pages
History Accumulation: Cash that passes through a node accumulates in "history" (converges to PageRank)
Damping Factor: 85% of cash flows through links, 15% teleports randomly

Benefits over batch PageRank:

No need to wait for full crawl to start prioritizing
Works on infinite/streaming web data
Handles dead-end pages automatically

HITS Algorithm

HITS (Hyperlink-Induced Topic Search) identifies two types of important pages:

Hubs: Pages that link to many authoritative pages (e.g., resource lists, directories)
Authorities: Pages that are linked by many hubs (e.g., official docs, primary sources)

Use sort_by=hub or sort_by=authority in the link-graph endpoint to find these.

Anchor Text Relevance

Links with descriptive anchor text are weighted higher for pages matching your crawl intent. For example, when crawling for "team information", a link with anchor text "Meet the Team" is prioritized over a link with anchor "Click here".

URL Category Scoring

Each discovered URL is classified into categories with different priority weights:

Category	Description	Weight
`content`	Main content pages (articles, products, docs)	High
`navigation`	Site navigation, menus, sitemaps	Medium
`functional`	Login, search, cart pages	Low
`resource`	Images, PDFs, downloads	Low
`external_content`	High-value external links	Medium

Efficiency Metrics

The efficiency object in Smart Crawl responses helps you understand crawl quality:

Metric	Description
`coverage_ratio`	Pages crawled / max_pages budget
`importance_captured`	Sum of PageRank of crawled pages (0-1)
`avg_importance_per_page`	importance_captured / pages_crawled

Interpretation:

importance_captured > 0.5 = Excellent - captured majority of site importance
avg_importance_per_page > 0.02 = Good - each page is valuable
High coverage + low importance = Consider increasing page budget or refining seed URLs

Session-Level Aggregation

When using sessions with multiple crawls, link graphs are automatically merged:

Cumulative PageRank: Recomputed across all crawl results
Domain Authority: Aggregated across all domains seen
Cross-Crawl Relationships: Links between pages from different crawls are captured
Deduplication: Same page discovered in multiple crawls is tracked once

This enables powerful multi-step research workflows where each crawl builds on previous discoveries.