Back to API Overview

Crawl API

Web crawling with AI-powered classification, structured data extraction, and recursive link following.

POST /api/v1/crawls

Create and execute a web crawl job.

Parameters

ParameterTypeDefaultDescription
urlsstring[]required1-100 seed URLs to crawl
max_pagesint10Maximum pages to crawl (1-1000)
processing_modestring"async""async", "sync", or "webhook"
classify_documentsboolfalseAI classification (opt-in, triggers advanced pricing)
extraction_schemaobject-Custom fields to extract (triggers advanced)
generate_summaryboolfalseAI summary of crawled docs (FREE)
expansionstring"none"URL expansion: "internal", "external", "both", "none"
intentstring-Natural language crawl goal (10-2000 chars). Guides URL prioritization when expansion is enabled
auto_generate_intentboolfalseAuto-generate intent from extraction_schema field descriptions
session_idUUID-Link to existing session
dry_runboolfalseValidate without executing
webhook_urlURL-Required if processing_mode="webhook"

Advanced Parameters

ParameterTypeDefaultDescription
follow_linksboolfalseEnable recursive crawling (legacy, prefer expansion)
link_extraction_configobject-max_depth (1-10), url_patterns[], exclude_patterns[], detect_pagination
classifiers[]array-AI filters: `{type: "url"
budget_configobject-max_pages, max_depth, max_credits
use_cacheboolfalseEnable global cache (50% savings on hits)
cache_max_ageint86400Cache TTL in seconds
premium_proxyboolfalseStart at premium proxy tier (residential IPs)
ultra_premiumboolfalseStart at ultra premium tier (mobile residential IPs, highest success rate)
retry_strategystring"auto""none" (single attempt), "auto" (escalate standard→premium→ultra_premium), "premium_only" (start at premium, escalate to ultra)

Credits

ModeCost
Basic (default - no AI features)1 credit/page
Advanced (classify_documents: true, extraction, or classifiers)2 credits/page
Cached (use_cache: true + cache hit)1 credit/page
SummaryFREE

Advanced pricing triggered by: classify_documents: true, extraction_schema, or classifiers[]

Processing Modes

ModeBehaviorBest For
asyncReturns immediately, poll for resultsLarge crawls, queued processing
syncWaits for completion (use max_pages ≤ 10)Small crawls, real-time needs
webhookReturns immediately, POSTs results to URLEvent-driven architectures

Smart Crawl

When expansion is enabled, URLs are automatically prioritized using intelligent algorithms. High-value pages are crawled first, maximizing information capture within your page budget.

Algorithms:

  • OPIC (Online Page Importance Computation): Streaming PageRank approximation that updates as you crawl
  • HITS: Hub and authority scoring to find link aggregators and authoritative sources
  • Anchor text relevance: Uses link text to prioritize pages matching your intent

Efficiency: 135x average discovery ratio - finds target pages in 5-6 steps vs 20+ traditional BFS crawling.

Intent-Driven Crawling

When you provide an intent parameter alongside expansion, the crawler uses your goal to guide URL prioritization:

  1. Depth auto-adjustment: Default depth increases from 1 to 2 (or 3 when intent mentions "individual", "specific", "detail" etc.)
  2. URL pattern boosting: URLs matching intent-related patterns get +0.3 score boost
  3. Budget allocation: Page budget is split into hub discovery (30%), detail expansion (60%), and exploration reserve (10%)
  4. Navigation penalty recovery: The -0.4 penalty on /about/, /team/ paths is partially recovered (+0.2) when matching intent patterns
  5. Effectiveness telemetry: Metrics stored per crawl: extraction_success_rate, budget_efficiency, depth_to_first_target

Example: Without intent, a 30-page crawl of example.com/team typically reaches 0 individual partner pages (budget exhausted on navigation). With intent: "Find individual partner bio pages", the same crawl reaches 15+ partner pages.

No extra cost: Intent-driven prioritization is heuristic-based and does not consume additional credits.

Response Enhancements:

When Smart Crawl is active, the GET /api/v1/crawls/{id} response includes a smart_crawl section:

{
  "smart_crawl": {
    "enabled": true,
    "algorithm": "opic",
    "stats": {
      "total_nodes": 150,
      "total_edges": 420,
      "avg_inlinks": 2.8,
      "graph_density": 0.0187
    },
    "top_pages": [
      {"url": "https://example.com/about", "importance": 0.0823, "inlinks": 12}
    ],
    "domain_authority": {
      "example.com": 0.85,
      "blog.example.com": 0.12
    },
    "efficiency": {
      "coverage_ratio": 0.75,
      "importance_captured": 0.42,
      "avg_importance_per_page": 0.021
    },
    "hubs": [{"url": "...", "score": 0.15}],
    "authorities": [{"url": "...", "score": 0.22}],
    "computed_at": "2026-01-27T10:30:00Z"
  }
}

Each result in results[] also includes Smart Crawl data when available:

{
  "url": "https://example.com/team",
  "title": "Our Team",
  "content": "...",
  "smart_crawl": {
    "importance_score": 0.0823,
    "inlink_count": 12,
    "outlink_count": 5,
    "hub_score": 0.02,
    "authority_score": 0.15,
    "anchor_texts": ["team", "about us", "our people"]
  }
}

Examples

# Basic crawl (5 credits: 5 × 1)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'

# Sync crawl with classification (10 credits: 5 × 2)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 5, "processing_mode": "sync"}'

# Crawl with extraction (10 credits: 5 × 2)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/blog"],
    "max_pages": 5,
    "classify_documents": false,
    "extraction_schema": {
      "title": "Extract the main article title",
      "author": "Extract the author name"
    }
  }'

# Internal link expansion with Smart Crawl
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 20, "expansion": "internal"}'

# Intent-driven crawl (reaches detail pages behind listings)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/team"],
    "max_pages": 30,
    "expansion": "internal",
    "intent": "Find individual team member bio pages with background and role",
    "extraction_schema": {
      "name": "Extract the person full name",
      "role": "Extract job title or role"
    }
  }'

# Dry run - estimate cost
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer wvr_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 50, "dry_run": true}'

Extraction Schema

Extract structured data using AI. Triggers advanced pricing.

Schema format:

{
  "extraction_schema": {
    "field_name": "Extraction instruction (min 10 chars)"
  }
}

Validation: Field names must match [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.

Response includes:

{
  "extracted_data": {"title": "Article Title", "author": "John Smith"},
  "extraction_metadata": {
    "fields_extracted": 2,
    "confidence_scores": {"title": 0.98, "author": 0.95}
  }
}

GET /api/v1/crawls/suggest-schema

Get information about the suggest-schema endpoint, including parameters and response format.

POST /api/v1/crawls/suggest-schema

Analyze a URL and suggest extraction fields. Cost: 2 credits

curl -X POST https://www.zipf.ai/api/v1/crawls/suggest-schema \
  -H "Authorization: Bearer wvr_TOKEN" \
  -d '{"url": "https://example.com/product"}'

Response: detected_page_type, suggested_schema, field_metadata with confidence scores.

Supported page types: E-commerce, Blog, News, Documentation, Job Listing, Recipe, Event, Company About


GET /api/v1/crawls/preview-extraction

Get information about the preview-extraction endpoint, including parameters and response format.

POST /api/v1/crawls/preview-extraction

Test extraction schema on a single URL. Cost: 1 credit

curl -X POST https://www.zipf.ai/api/v1/crawls/preview-extraction \
  -H "Authorization: Bearer wvr_TOKEN" \
  -d '{
    "url": "https://example.com/article",
    "extraction_schema": {"title": "Extract the article title"}
  }'

GET /api/v1/crawls

List crawl jobs with pagination.

Query params: limit (1-100), offset, status (pending/running/completed/failed/cancelled)

GET /api/v1/crawls/{id}

Get detailed crawl status, results, classifications, summary, and Smart Crawl analysis.

Response includes:

  • access_level - Your access level: owner, org_viewer, or public
  • smart_crawl section (when expansion was used) with link graph stats, top pages by importance, domain authority, and efficiency metrics. See Smart Crawl section above.

DELETE /api/v1/crawls/{id}

Cancel a running/scheduled crawl. Releases reserved credits.


Detailed link graph analysis for crawls and sessions. Cost: FREE

Get paginated link graph data for a single crawl.

Query Parameters:

ParameterDefaultDescription
include_nodestrueInclude node data in response
include_edgesfalseInclude edge data (can be large)
node_limit100Max nodes to return (1-1000)
sort_bypagerankSort nodes by: pagerank, inlinks, authority, hub
min_importance0Filter nodes by minimum PageRank score

Response:

{
  "crawl_id": "uuid",
  "status": "completed",
  "pages_crawled": 20,
  "link_graph": {
    "stats": {
      "total_nodes": 150,
      "total_edges": 420,
      "avg_inlinks": 2.8,
      "avg_outlinks": 3.1,
      "max_inlinks": 15,
      "max_outlinks": 25,
      "graph_density": 0.0187,
      "top_domains": [{"domain": "example.com", "count": 120}]
    },
    "domain_authority": {
      "example.com": 0.85,
      "docs.example.com": 0.08
    },
    "computed_at": "2026-01-27T10:30:00Z",
    "nodes": [
      {
        "url": "https://example.com/about",
        "pagerank": 0.0823,
        "inlinks": 12,
        "outlinks": 5,
        "hub_score": 0.02,
        "authority_score": 0.15,
        "domain": "example.com"
      }
    ],
    "edges": [
      {"from": "https://example.com", "to": "https://example.com/about", "anchor": "About Us"}
    ]
  },
  "pagination": {
    "nodes_returned": 100,
    "nodes_total": 150,
    "edges_returned": 200,
    "edges_total": 420,
    "sort_by": "pagerank",
    "min_importance": 0,
    "node_limit": 100
  }
}

Example:

# Get top 50 pages by PageRank
curl "https://www.zipf.ai/api/v1/crawls/{id}/link-graph?node_limit=50&sort_by=pagerank" \
  -H "Authorization: Bearer wvr_TOKEN"

# Get pages with high authority scores and their edges
curl "https://www.zipf.ai/api/v1/crawls/{id}/link-graph?sort_by=authority&include_edges=true&min_importance=0.01" \
  -H "Authorization: Bearer wvr_TOKEN"

For aggregated link graphs across all crawls in a session, see Sessions API - Link Graph.


GET /api/v1/crawls/{id}/execute

Get execution options and history for a crawl job.

Response: Available execution options, cancellable statuses, and execution history.

POST /api/v1/crawls/{id}/execute

Execute a scheduled crawl immediately.

Params: force, webhook_url, browser_automation, scraperapi

POST /api/v1/crawls/{id}/share

Toggle public or organization sharing.

Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}

Note: Crawls within sessions inherit org sharing from their parent session.

GET /api/v1/public/crawl/{id}

Access publicly shared crawl (no auth required).

GET /api/v1/search/jobs/{id}/crawls

Get all crawls created from a search job's URLs (via "Crawl All" feature).

Response: Array of crawl jobs created from the search results.


Workflow Recrawl API

Importance-based recrawl prioritization for workflows. High-value pages are recrawled more frequently.

Prerequisites: Workflow with session, Smart Crawl enabled

Endpoints

MethodEndpointDescription
GET/api/v1/workflows/{id}/recrawlGet recrawl config and status
POST/api/v1/workflows/{id}/recrawlEnable/update recrawl
DELETE/api/v1/workflows/{id}/recrawlDisable recrawl
PATCH/api/v1/workflows/{id}/recrawlTrigger immediate recrawl

POST Parameters

ParameterDefaultDescription
strategy"importance""importance", "time", "hybrid"
min_interval_hours24Minimum hours between recrawls
max_interval_hours720Maximum hours before recrawl
importance_multiplier2.0How much importance affects frequency
change_detectiontrueTrack content changes via SHA-256
priority_threshold0.0Only recrawl URLs above this score
execute_nowfalseImmediately execute after enabling

Example

# Enable importance-based recrawl
curl -X POST https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
  -H "Authorization: Bearer wvr_TOKEN" \
  -d '{"strategy": "importance", "min_interval_hours": 12, "execute_now": true}'

# Check status
curl https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
  -H "Authorization: Bearer wvr_TOKEN"

# Trigger manual recrawl
curl -X PATCH https://www.zipf.ai/api/v1/workflows/WORKFLOW_ID/recrawl \
  -H "Authorization: Bearer wvr_TOKEN"

Recrawl Interval Formula

interval = max_interval - (importance × multiplier × (max_interval - min_interval))

With defaults: importance 1.0 → 24h, importance 0.5 → ~15 days, importance 0.1 → ~27 days


Smart Crawl Deep Dive

How OPIC Works

OPIC (Online Page Importance Computation) is a streaming approximation of PageRank that works incrementally as you crawl:

  1. Initialization: Each seed URL starts with equal "cash" (priority score)
  2. Cash Distribution: When a page is crawled, its cash moves to linked pages
  3. History Accumulation: Cash that passes through a node accumulates in "history" (converges to PageRank)
  4. Damping Factor: 85% of cash flows through links, 15% teleports randomly

Benefits over batch PageRank:

  • No need to wait for full crawl to start prioritizing
  • Works on infinite/streaming web data
  • Handles dead-end pages automatically

HITS Algorithm

HITS (Hyperlink-Induced Topic Search) identifies two types of important pages:

  • Hubs: Pages that link to many authoritative pages (e.g., resource lists, directories)
  • Authorities: Pages that are linked by many hubs (e.g., official docs, primary sources)

Use sort_by=hub or sort_by=authority in the link-graph endpoint to find these.

Anchor Text Relevance

Links with descriptive anchor text are weighted higher for pages matching your crawl intent. For example, when crawling for "team information", a link with anchor text "Meet the Team" is prioritized over a link with anchor "Click here".

URL Category Scoring

Each discovered URL is classified into categories with different priority weights:

CategoryDescriptionWeight
contentMain content pages (articles, products, docs)High
navigationSite navigation, menus, sitemapsMedium
functionalLogin, search, cart pagesLow
resourceImages, PDFs, downloadsLow
external_contentHigh-value external linksMedium

Efficiency Metrics

The efficiency object in Smart Crawl responses helps you understand crawl quality:

MetricDescription
coverage_ratioPages crawled / max_pages budget
importance_capturedSum of PageRank of crawled pages (0-1)
avg_importance_per_pageimportance_captured / pages_crawled

Interpretation:

  • importance_captured > 0.5 = Excellent - captured majority of site importance
  • avg_importance_per_page > 0.02 = Good - each page is valuable
  • High coverage + low importance = Consider increasing page budget or refining seed URLs

Session-Level Aggregation

When using sessions with multiple crawls, link graphs are automatically merged:

  1. Cumulative PageRank: Recomputed across all crawl results
  2. Domain Authority: Aggregated across all domains seen
  3. Cross-Crawl Relationships: Links between pages from different crawls are captured
  4. Deduplication: Same page discovered in multiple crawls is tracked once

This enables powerful multi-step research workflows where each crawl builds on previous discoveries.


See Also

Skip to main content
Crawl API - Zipf AI Documentation