Back to API Overview

Crawl API

Agent-driven web crawling with AI-powered classification, structured data extraction, and recursive link following.

Credits: 1/page (basic) | 2/page (with classify_documents, extraction_schema, classifiers, or intent) | FREE (summary)

POST /v1/crawls

Deploy an agent to crawl and extract from web sources.

Parameters

ParameterTypeDefaultDescription
urlsstring[]required1-100 seed URLs to crawl
max_pagesint10Maximum pages to crawl (1-1000)
processing_modestring"async""async" or "sync"
classify_documentsboolfalseAI classification (triggers advanced pricing)
extraction_schemaobject-Custom fields to extract (triggers advanced)
generate_summaryboolfalseAI summary of crawled docs (FREE)
expansionstring"none"URL expansion: "internal", "external", "both", "none"
intentstring-Natural language crawl goal (10-2000 chars). Activates neural URL scoring and junk filtering. Triggers advanced pricing (2 credits/page).
auto_generate_intentboolfalseAuto-generate intent from extraction_schema field descriptions
session_idUUID-Link to existing session
dry_runboolfalseValidate without executing

Advanced Parameters

ParameterTypeDefaultDescription
follow_linksboolfalseRecursive crawling (legacy, prefer expansion)
link_extraction_configobject-max_depth (1-10), url_patterns[], exclude_patterns[], detect_pagination
classifiers[]array-AI filters: `{type: "url"
budget_configobject-max_pages, max_depth, max_credits
use_cacheboolfalseGlobal cache (50% savings on hits)
cache_max_ageint86400Cache TTL in seconds
premium_proxyboolfalseResidential IPs
ultra_premiumboolfalseMobile residential IPs (highest success)
retry_strategystring"auto""none", "auto" (escalate tiers), "premium_only"

Processing Modes

ModeBehaviorBest For
asyncReturns immediately, poll for resultsLarge crawls
syncWaits for completion (max_pages <= 10)Small crawls, real-time

Smart Crawl

When expansion is enabled, the agent prioritizes URLs using intelligent algorithms:

  • OPIC: Streaming PageRank approximation that updates as you crawl
  • HITS: Hub and authority scoring for link aggregators and authoritative sources
  • Anchor text relevance: Link text matching your intent gets priority

Efficiency: 135x average discovery ratio vs traditional BFS crawling.

Intent-Driven Crawling

With intent + expansion, the agent activates neural URL scoring that dramatically improves crawl efficiency on navigation-heavy sites:

  1. LLM-driven URL prioritization: Every candidate URL is scored for relevance before fetching
  2. Junk URL filtering: Known non-content URLs are filtered before they enter the queue
  3. Intent-driven scoring weights: Scoring shifts from graph-dominated to model-dominated
  4. Depth auto-adjustment: Default depth 1->2 (or 3 with "individual"/"specific"/"detail" keywords)
  5. Budget allocation: 30% hub discovery, 60% detail expansion, 10% exploration

Pricing: Intent triggers advanced pricing (2 credits/page instead of 1).

Examples

# Basic crawl (5 credits)
curl -X POST https://api.zipf.ai/v1/crawls \
  -H "Authorization: Bearer $ZIPF_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'

# Intent-driven crawl with extraction
curl -X POST https://api.zipf.ai/v1/crawls \
  -H "Authorization: Bearer $ZIPF_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/team"],
    "max_pages": 30,
    "expansion": "internal",
    "intent": "Find individual team member bio pages with background and role",
    "extraction_schema": {
      "name": "Extract the person full name",
      "role": "Extract job title or role"
    }
  }'

Extraction Schema

Extract structured data using AI. Triggers advanced pricing.

Format: {"field_name": "Extraction instruction (min 10 chars)"}. Field names: [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.

Response: extracted_data (field->value), extraction_metadata (fields_extracted, confidence_scores).


GET /v1/crawls

List crawl jobs. Query: limit (1-100), offset, status (pending/running/completed/failed/cancelled)

GET /v1/crawls/{crawl_id}

Get detailed crawl status, results, classifications, and summary.

Response includes: access_level (owner/org_viewer/public).

When Smart Crawl was active, includes smart_crawl section:

FieldDescription
smart_crawl.enabledWhether Smart Crawl was active
smart_crawl.algorithmAlgorithm used (opic)
smart_crawl.statstotal_nodes, total_edges, avg_inlinks, graph_density
smart_crawl.top_pages[]url, importance, inlinks
smart_crawl.domain_authorityDomain -> authority score map
smart_crawl.efficiencycoverage_ratio, importance_captured, avg_importance_per_page
smart_crawl.hubs[]url, score
smart_crawl.authorities[]url, score

Each result also includes: importance_score, inlink_count, outlink_count, hub_score, authority_score, anchor_texts[].

DELETE /v1/crawls/{crawl_id}

Cancel a running/scheduled crawl. Releases reserved credits.

POST /v1/crawls/{crawl_id}/abort

Abort a running crawl job and release reserved credits.

POST /v1/crawls/{crawl_id}/share

Toggle sharing. Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}

Crawls in sessions inherit org sharing from parent session.

POST /v1/crawls/estimate

Estimate crawl cost without executing.

Parameters: Same as POST /v1/crawls.

Response: cost_estimate, would_succeed, credits_balance

GET /v1/public/crawls/{crawl_id}

Access publicly shared crawl (no auth).


POST /v1/sessions/{session_id}/crawl

Crawl within a session context. Same parameters as POST /v1/crawls plus:

  • filter_seen_urls (Bool, default: true) - Exclude URLs already in session

See Sessions API for details.

Skip to main content
Crawl API - Zipf AI Documentation