API Reference

The Spider API is based on REST. Our API is predictable, returns JSON-encoded responses, uses standardized HTTP response codes, and authentication. The API supports bulk updates. You can work on multiple objects per request for the core endpoints.

Authentication

Include your API key in the authorization header.

Authorization: Bearer sk-xxxx...

Response formats

Set the content-type header to shape the response.

application/jsonapplication/xmltext/csvapplication/jsonl

Prefix any path with v1 to lock the version. Requests on this page consume live credits.

Just getting started? Quickstart guide →

Not a developer? Use Spider's no-code options to get started without writing code.

Base Url

https://api.spider.cloud

Client libraries

CLI JavaScript Python Rust

OpenAPI Spec llms.txt

Common Parameters

These parameters are shared across Crawl, Scrape, Unblocker, Search, Links, Screenshot, and Fetch. Click any parameter to jump to its full description in the Crawl section.

Advanced(35)blacklist, block_ads, block_analytics, block_stylesheets, budget, chunking_alg, concurrency_limit, crawl_timeout, data_connectors, depth, disable_intercept, event_tracker, exclude_selector, execution_scripts, external_domains, full_resources, max_credits_allowed, max_credits_per_page, metadata, preserve_host, redirect_policy, request, request_timeout, root_selector, run_in_background, session, sitemap, sitemap_only, sitemap_path, subdomains, tld, user_agent, wait_for, webhooks, whitelist

Name	Type	Default	Description
blacklist	array	—	Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
block_ads	boolean	true	Block advertisements when running request as `chrome` or `smart`. This can greatly increase performance.
block_analytics	boolean	true	Block analytics when running request as `chrome` or `smart`. This can greatly increase performance.
block_stylesheets	boolean	true	Block stylesheets when running request as `chrome` or `smart`. This can greatly increase performance.
budget	object	—	Object that has paths with a counter for limiting the amount of pages. Use `{"*":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. `{ "/docs/colors": 10, "/docs/": 100 }`.
chunking_alg	object	—	Use a chunking algorithm to segment your content output. Pass an object like `{ "type": "bysentence", "value": 2 }` to split the text into an array by every 2 sentences. Works well with markdown or text formats.
concurrency_limit	number	—	Set the concurrency limit to help balance request for slower websites. The default is unlimited.
crawl_timeout	object	—	The `crawl_timeout` parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins. The values for the timeout duration are in the object shape `{ secs: 300, nanos: 0 }`.
data_connectors	object	—	Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. `{ s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }`
depth	number	25	The crawl limit for maximum depth. If `0`, no limit will be applied.
disable_intercept	boolean	false	Disable request interception when running request as `chrome` or `smart`. This may help bypass pages that use third-party scripts or external domains.
event_tracker	object	—	Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following `requests` and `responses` for the network output of the page. `automation` will send detailed information including a screenshot of each automation step used under `automation_scripts`.
exclude_selector	string	—	A CSS query selector to use for ignoring content from the markup of the response.
execution_scripts	object	—	Run custom JavaScript on certain paths. Requires `chrome` or `smart` request mode. The values should be in the shape `"/path_or_url": "custom js"`.
external_domains	array	—	A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to `*` to include all domains.
full_resources	boolean	—	Crawl and download all the resources for a website.
max_credits_allowed	number	—	Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_per_page	number	—	Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
metadata	boolean	false	Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
preserve_host	boolean	false	Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
redirect_policy	string	Loose	The network redirect policy to use when performing HTTP request.
request	string	smart	The request type to perform. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML.
request_timeout	number	60	The timeout to use for request. Timeouts can be from `5-255` seconds.
root_selector	string	—	The root CSS query selector to use extracting content from the markup for the response.
run_in_background	boolean	false	Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
session	boolean	true	Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
sitemap	boolean	false	Include the sitemap results to crawl.
sitemap_only	boolean	false	Only include the sitemap results to crawl.
sitemap_path	string	sitemap.xml	The sitemap URL to use when using `sitemap`.
subdomains	boolean	false	Allow subdomains to be included.
tld	boolean	false	Allow TLD's to be included.
user_agent	string	—	Add a custom HTTP user agent to the request. By default this is set to a random agent.
wait_for	object	—	The `wait_for` parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters: The key `idle_network` specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value. The key `idle_network0` specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value. The key `almost_idle_network0` specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value. The key `selector` specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for. The key `dom` specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for. The key `delay` specifies a delay to wait for, with an optional timeout value. The key `page_navigations`set to `true` then waiting for all page navigations will be handled. If `wait_for` is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds. The values for the timeout duration are in the object shape `{ secs: 10, nanos: 0 }`.
webhooks	object	—	Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. `{ destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }`
whitelist	array	—	Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

Core(5)disable_hints, limit, lite_mode, network_blacklist, network_whitelist

Name	Type	Default	Description
disable_hints	boolean	—	Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating `network_blacklist`/`network_whitelist` recommendations based on observed request-pattern outcomes). Hints are enabled by default for all `smart` request modes. Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
limit	number	0	The maximum amount of pages allowed to crawl per website. Remove the value or set it to `0` to crawl all pages.
lite_mode	boolean	—	Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklist	string[]	—	Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested. Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence. Good targets: `googletagmanager.com`, `doubleclick.net`, `maps.googleapis.com` Prefer specific domains over broad substrings to avoid breaking essential assets.
network_whitelist	string[]	—	Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution. Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default. Start with first-party: `example.com`, `cdn.example.com` Add only what you observe you truly need (fonts/CDNs), then iterate.

Output(16)clean_html, css_extraction_map, encoding, filter_images, filter_output_images, filter_output_main_only, filter_output_svg, filter_svg, link_rewrite, readability, return_cookies, return_embeddings, return_format, return_headers, return_json_data, return_page_links

Name	Type	Default	Description
clean_html	boolean	—	Clean the HTML of unwanted attributes.
css_extraction_map	object	—	Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
encoding	string	—	The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc.
filter_images	boolean	—	Filter image elements from the markup.
filter_output_images	boolean	—	Filter the images from the output.
filter_output_main_only	boolean	true	Filter the nav, aside, and footer from the output.
filter_output_svg	boolean	—	Filter the svg tags from the output.
filter_svg	boolean	—	Filter SVG elements from the markup.
link_rewrite	json	—	Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another). The value must be a JSON object with a `type` field. Supported types: `"replace"` – simple substring replacement. Fields: `host?: string` (optional) – only apply when the link's host matches this value (e.g. `"blog.example.com"`). `find: string` – substring to search for in the URL. `replace_with: string` – replacement substring. `"regex"` – regex-based rewrite with capture groups. Fields: `host?: string` (optional) – only apply for this host. `pattern: string` – regex applied to the full URL. `replace_with: string` – replacement string supporting `$1`, `$2`, etc. Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
readability	boolean	false	Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
return_cookies	boolean	false	Return the HTTP response cookies with the results.
return_embeddings	boolean	false	Include OpenAI embeddings for `title` and `description`. Requires `metadata` to be enabled.
return_format	string \| array	raw	The format to return the data in. Possible values are `markdown`, `commonmark`, `raw`, `text`, `xml`, `bytes`, and `empty`. Use `raw` to return the default format of the page like `HTML` etc.
return_headers	boolean	false	Return the HTTP response headers with the results.
return_json_data	boolean	false	Return the JSON data found in scripts used for SSR.
return_page_links	boolean	false	Return the links found on each page.

Config(7)cookies, fingerprint, headers, proxy, proxy_enabled, remote_proxy, stealth

Name	Type	Default	Description
cookies	string	—	Add HTTP cookies to use for request.
fingerprint	boolean	true	Use advanced fingerprint detection for chrome.
headers	object	—	Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
proxy	'residential' \| 'mobile' \| 'isp'	—	Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other `proxy_*` shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
proxy_enabled	boolean	false	Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
remote_proxy	string	—	Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
stealth	boolean	true	Use stealth mode for headless chrome request to help prevent being blocked.

Performance(5)cache, delay, respect_robots, service_worker_enabled, skip_config_checks

Name	Type	Default	Description
cache	boolean \| { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }	true	Use HTTP caching for the crawl to speed up repeated runs. Defaults to `true`. Accepts either: `true` / `false` A cache control object: `maxAge` (ms) — freshness window (default: `172800000` = 2 days). Set `0` for always fetch fresh. `allowStale` — serve cached results even if stale. `period` — RFC3339 timestamp cutoff (overrides `maxAge`), e.g. `"2025-11-29T12:00:00Z"` `skipBrowser` — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses. Default behavior by route type: Standard routes (`/crawl`, `/scrape`, `/unblocker`) — cache is `true` with `skipBrowser` enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set `cache: false` or `{ "skipBrowser": false }`. AI routes (`/ai/crawl`, `/ai/scrape`, etc.) — cache is `true` but `skipBrowser` is not enabled. AI routes always use the browser to ensure live page content for extraction.
delay	number	0	Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
respect_robots	boolean	true	Respect the robots.txt file for crawling.
service_worker_enabled	boolean	true	Allow the website to use Service Workers as needed.
skip_config_checks	boolean	true	Skip checking the database for website configuration. This will increase performance for requests that use limit=1.

Automation(4)automation_scripts, evaluate_on_new_document, scroll, viewport

Name	Type	Default	Description
automation_scripts	object	—	Run custom web automated tasks on certain paths. Requires `chrome` or `smart` request mode. Below are the available actions for web automation: Evaluate: Runs custom JavaScript code. `{ "Evaluate": "console.log('Hello, World!');" }` Click: Clicks on an element identified by a CSS selector. `{ "Click": "button#submit" }` ClickAll: Clicks on all elements matching a CSS selector. `{ "ClickAll": "button.loadMore" }` ClickPoint: Clicks at the position x and y coordinates. `{ "ClickPoint": { "x": 120.5, "y": 340.25 } }` ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.). `{ "ClickAllClickable": true }` ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds. `{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }` ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds. `{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }` ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier. `{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }` ClickDragPoint: Click-and-drag from one point to another with optional modifier. `{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }` Wait: Waits for a specified duration in milliseconds. `{ "Wait": 2000 }` WaitForNavigation: Waits for the next navigation event. `{ "WaitForNavigation": true }` WaitFor: Waits for an element to appear identified by a CSS selector. `{ "WaitFor": "div#content" }` WaitForWithTimeout: Waits for an element to appear with a timeout (ms). `{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }` WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector. `{ "WaitForAndClick": "button#loadMore" }` WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms). `{ "WaitForDom": { "selector": "main", "timeout": 12000 } }` ScrollX: Scrolls the screen horizontally by a specified number of pixels. `{ "ScrollX": 100 }` ScrollY: Scrolls the screen vertically by a specified number of pixels. `{ "ScrollY": 200 }` Fill: Fills an input element with a specified value. `{ "Fill": { "selector": "input#name", "value": "John Doe" } }` Type: Type a key into the browser with an optional modifier. `{ "Type": { "value": "John Doe", "modifier": 0 } }` InfiniteScroll: Scrolls the page until the end for certain duration. `{ "InfiniteScroll": 3000 }` Screenshot: Perform a screenshot on the page. `{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }` ValidateChain: Set this before a step to validate the prior action to break out of the chain. `{ "ValidateChain": true }`
evaluate_on_new_document	string	—	Set a custom script to evaluate on new document creation.
scroll	number	—	Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the `wait_for` parameters. Requires `chrome` request mode.
viewport	object	—	Configure the viewport for chrome.

Geolocation(2)country_code, locale

Name	Type	Default	Description
country_code	string	—	Set a ISO country code for proxy connections. View the locations list for available countries.
locale	string	—	The locale to use for request, example `en-US`.

Per-endpoint notes

•

Scrape & Unblocker exclude limit, depth, and delay. Single-page endpoints.

•

Screenshot exclude request, return_format, and readability. Returns image data.

Every endpoint below includes these parameters in its own parameter tabs with full descriptions. This section is a quick-reference index.

Crawl

Details

Start crawling website(s) to collect resources. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/crawl

urlCrawl API -
stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limitCrawl API -
number
Default: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
disable_hintsCrawl API -
boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
If you're tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
lite_modeCrawl API -
boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklistCrawl API -
string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
network_whitelistCrawl API -
string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.

requestCrawl API -
string
Default: smart
http
chrome
smart
The request type to perform. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
The request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults.
depthCrawl API -
number
Default: 25
The crawl limit for maximum depth. If 0, no limit will be applied.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadataCrawl API -
boolean
Default: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
Using metadata can help extract critical information to use for AI.
sessionCrawl API -
boolean
Default: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
request_timeoutCrawl API -
number
Default: 60
The timeout to use for request. Timeouts can be from 5-255 seconds.
The timeout helps prevent long request times from hanging.
wait_forCrawl API -
object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooksCrawl API -
object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agentCrawl API -
string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemapCrawl API -
boolean
Default: false
Include the sitemap results to crawl.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_onlyCrawl API -
boolean
Default: false
Only include the sitemap results to crawl.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_pathCrawl API -
string
Default: sitemap.xml
The sitemap URL to use when using sitemap.
subdomainsCrawl API -
boolean
Default: false
Allow subdomains to be included.
tldCrawl API -
boolean
Default: false
Allow TLD's to be included.
root_selectorCrawl API -
string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_hostCrawl API -
boolean
Default: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
full_resourcesCrawl API -
boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policyCrawl API -
string
Default: Loose
Loose
Strict
None
The network redirect policy to use when performing HTTP request.
Loose will only capture the initial page redirect to the resource. Include the website in external_domains to allow crawling outside of the domain.
external_domainsCrawl API -
array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selectorCrawl API -
string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limitCrawl API -
number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scriptsCrawl API -
object
Run custom JavaScript on certain paths. Requires chrome or smart request mode. The values should be in the shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_interceptCrawl API -
boolean
Default: false
Disable request interception when running request as chrome or smart. This may help bypass pages that use third-party scripts or external domains.
Cost and speed may increase when disabling this feature, as it removes native Chrome interception.
block_adsCrawl API -
boolean
Default: true
Block advertisements when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_analyticsCrawl API -
boolean
Default: true
Block analytics when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_stylesheetsCrawl API -
boolean
Default: true
Block stylesheets when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
run_in_backgroundCrawl API -
boolean
Default: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
Requires webhooks to be enabled.
chunking_algCrawl API -
object
ByWords
ByLines
ByCharacterLength
BySentence
Use a chunking algorithm to segment your content output. Pass an object like { "type": "bysentence", "value": 2 } to split the text into an array by every 2 sentences. Works well with markdown or text formats.
The chunking algorithm allows you to prepare content for AI without needing extra code or loaders.
budgetCrawl API -
object
Object that has paths with a counter for limiting the amount of pages. Use {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. { "/docs/colors": 10, "/docs/": 100 }.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_pageCrawl API -
number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_allowedCrawl API -
number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_trackerCrawl API -
object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
blacklistCrawl API -
array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
whitelistCrawl API -
array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
crawl_timeoutCrawl API -
object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
data_connectorsCrawl API -
object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. { s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }

return_formatCrawl API -
string | array
Default: raw
markdown
commonmark
raw
text
xml
bytes
empty
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use bytes or raw. PDF transformations may take up to 1 cent per page for high accuracy.
readabilityCrawl API -
boolean
Default: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_mapCrawl API -
object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
You can scrape using CSS selectors at no extra cost.
link_rewriteCrawl API -
json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
clean_htmlCrawl API -
boolean
Clean the HTML of unwanted attributes.
filter_svgCrawl API -
boolean
Filter SVG elements from the markup.
filter_imagesCrawl API -
boolean
Filter image elements from the markup.
return_json_dataCrawl API -
boolean
Default: false
Return the JSON data found in scripts used for SSR.
Useful for getting JSON-ready data for LLMs and data from websites built with Next.js etc.
return_headersCrawl API -
boolean
Default: false
Return the HTTP response headers with the results.
Getting the HTTP headers can help setup authentication flows.
return_cookiesCrawl API -
boolean
Default: false
Return the HTTP response cookies with the results.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_linksCrawl API -
boolean
Default: false
Return the links found on each page.
Getting the links can help index the reference locations found for the resource.
filter_output_svgCrawl API -
boolean
Filter the svg tags from the output.
filter_output_imagesCrawl API -
boolean
Filter the images from the output.
filter_output_main_onlyCrawl API -
boolean
Default: true
Filter the nav, aside, and footer from the output.
encodingCrawl API -
string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
return_embeddingsCrawl API -
boolean
Default: false
Include OpenAI embeddings for title and description. Requires metadata to be enabled.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxyCrawl API -
'residential' | 'mobile' | 'isp'
residential
mobile
isp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other proxy_* shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2 for mobile).
remote_proxyCrawl API -
string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookiesCrawl API -
string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headersCrawl API -
object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
fingerprintCrawl API -
boolean
Default: true
Use advanced fingerprint detection for chrome.
Set this value to help crawl when websites require a fingerprint.
stealthCrawl API -
boolean
Default: true
Use stealth mode for headless chrome request to help prevent being blocked.
Set to true to almost guarantee not being detected by anything.
proxy_enabledCrawl API -
boolean
Default: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5×.

cacheCrawl API -
boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to true.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
Standard routes (/crawl, /scrape, /unblocker) — cache is true with skipBrowser enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set cache: false or { "skipBrowser": false }.
AI routes (/ai/crawl, /ai/scrape, etc.) — cache is true but skipBrowser is not enabled. AI routes always use the browser to ensure live page content for extraction.
Caching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses.
delayCrawl API -
number
Default: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
respect_robotsCrawl API -
boolean
Default: true
Respect the robots.txt file for crawling.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Use this config sparingly.
skip_config_checksCrawl API -
boolean
Default: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
service_worker_enabledCrawl API -
boolean
Default: true
Allow the website to use Service Workers as needed.
Enabling service workers can allow websites that explicitly run background tasks to load data.

scrollCrawl API -
number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the wait_for parameters. Requires chrome request mode.
Use wait_for to scroll until a condition is met and disable_intercept to get data from the network regardless of hostname.
viewportCrawl API -
object
Configure the viewport for chrome.
To emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414).
automation_scriptsCrawl API -
object
Run custom web automated tasks on certain paths. Requires chrome or smart request mode.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
evaluate_on_new_documentCrawl API -
string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "duration_elapsed_ms": 122,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Scrape

Details

Start scraping a single page on website(s) to collect resources. You can pass an array of objects for the request body. This endpoint is also available via Proxy-Mode.

POSThttps://api.spider.cloud/scrape

urlScrape API -
stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
disable_hintsScrape API -
boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
If you're tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
lite_modeScrape API -
boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklistScrape API -
string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
network_whitelistScrape API -
string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.

requestScrape API -
string
Default: smart
http
chrome
smart
The request type to perform. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
The request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults.
metadataScrape API -
boolean
Default: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
Using metadata can help extract critical information to use for AI.
sessionScrape API -
boolean
Default: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
request_timeoutScrape API -
number
Default: 60
The timeout to use for request. Timeouts can be from 5-255 seconds.
The timeout helps prevent long request times from hanging.
wait_forScrape API -
object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooksScrape API -
object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agentScrape API -
string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemapScrape API -
boolean
Default: false
Include the sitemap results to crawl.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_onlyScrape API -
boolean
Default: false
Only include the sitemap results to crawl.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_pathScrape API -
string
Default: sitemap.xml
The sitemap URL to use when using sitemap.
subdomainsScrape API -
boolean
Default: false
Allow subdomains to be included.
tldScrape API -
boolean
Default: false
Allow TLD's to be included.
root_selectorScrape API -
string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_hostScrape API -
boolean
Default: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
full_resourcesScrape API -
boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policyScrape API -
string
Default: Loose
Loose
Strict
None
The network redirect policy to use when performing HTTP request.
Loose will only capture the initial page redirect to the resource. Include the website in external_domains to allow crawling outside of the domain.
external_domainsScrape API -
array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selectorScrape API -
string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limitScrape API -
number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scriptsScrape API -
object
Run custom JavaScript on certain paths. Requires chrome or smart request mode. The values should be in the shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_interceptScrape API -
boolean
Default: false
Disable request interception when running request as chrome or smart. This may help bypass pages that use third-party scripts or external domains.
Cost and speed may increase when disabling this feature, as it removes native Chrome interception.
block_adsScrape API -
boolean
Default: true
Block advertisements when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_analyticsScrape API -
boolean
Default: true
Block analytics when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_stylesheetsScrape API -
boolean
Default: true
Block stylesheets when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
run_in_backgroundScrape API -
boolean
Default: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
Requires webhooks to be enabled.
chunking_algScrape API -
object
ByWords
ByLines
ByCharacterLength
BySentence
Use a chunking algorithm to segment your content output. Pass an object like { "type": "bysentence", "value": 2 } to split the text into an array by every 2 sentences. Works well with markdown or text formats.
The chunking algorithm allows you to prepare content for AI without needing extra code or loaders.
budgetScrape API -
object
Object that has paths with a counter for limiting the amount of pages. Use {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. { "/docs/colors": 10, "/docs/": 100 }.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_pageScrape API -
number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_allowedScrape API -
number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_trackerScrape API -
object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
blacklistScrape API -
array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
whitelistScrape API -
array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
crawl_timeoutScrape API -
object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
data_connectorsScrape API -
object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. { s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }
full_pageScrape API -
boolean
Default: true
Take a screenshot of the full page.

return_formatScrape API -
string | array
Default: raw
markdown
commonmark
raw
text
xml
bytes
empty
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use bytes or raw. PDF transformations may take up to 1 cent per page for high accuracy.
readabilityScrape API -
boolean
Default: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_mapScrape API -
object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
You can scrape using CSS selectors at no extra cost.
link_rewriteScrape API -
json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
clean_htmlScrape API -
boolean
Clean the HTML of unwanted attributes.
filter_svgScrape API -
boolean
Filter SVG elements from the markup.
filter_imagesScrape API -
boolean
Filter image elements from the markup.
return_json_dataScrape API -
boolean
Default: false
Return the JSON data found in scripts used for SSR.
Useful for getting JSON-ready data for LLMs and data from websites built with Next.js etc.
return_headersScrape API -
boolean
Default: false
Return the HTTP response headers with the results.
Getting the HTTP headers can help setup authentication flows.
return_cookiesScrape API -
boolean
Default: false
Return the HTTP response cookies with the results.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_linksScrape API -
boolean
Default: false
Return the links found on each page.
Getting the links can help index the reference locations found for the resource.
filter_output_svgScrape API -
boolean
Filter the svg tags from the output.
filter_output_imagesScrape API -
boolean
Filter the images from the output.
filter_output_main_onlyScrape API -
boolean
Default: true
Filter the nav, aside, and footer from the output.
encodingScrape API -
string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
return_embeddingsScrape API -
boolean
Default: false
Include OpenAI embeddings for title and description. Requires metadata to be enabled.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
binaryScrape API -
boolean
Return the image as binary instead of base64.
cdp_paramsScrape API -
object
Default: null
The settings to use to adjust clip, format, quality, and more.

proxyScrape API -
'residential' | 'mobile' | 'isp'
residential
mobile
isp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other proxy_* shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2 for mobile).
remote_proxyScrape API -
string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookiesScrape API -
string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headersScrape API -
object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
fingerprintScrape API -
boolean
Default: true
Use advanced fingerprint detection for chrome.
Set this value to help crawl when websites require a fingerprint.
stealthScrape API -
boolean
Default: true
Use stealth mode for headless chrome request to help prevent being blocked.
Set to true to almost guarantee not being detected by anything.
proxy_enabledScrape API -
boolean
Default: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5×.

cacheScrape API -
boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to true.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
Standard routes (/crawl, /scrape, /unblocker) — cache is true with skipBrowser enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set cache: false or { "skipBrowser": false }.
AI routes (/ai/crawl, /ai/scrape, etc.) — cache is true but skipBrowser is not enabled. AI routes always use the browser to ensure live page content for extraction.
Caching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses.
respect_robotsScrape API -
boolean
Default: true
Respect the robots.txt file for crawling.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Use this config sparingly.
skip_config_checksScrape API -
boolean
Default: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
service_worker_enabledScrape API -
boolean
Default: true
Allow the website to use Service Workers as needed.
Enabling service workers can allow websites that explicitly run background tasks to load data.
block_imagesScrape API -
boolean
Default: false
Block the images from loading to speed up the screenshot.
fastScrape API -
boolean
Default: true
Use fast screenshot mode for speed-optimized rendering. Set to false for high-fidelity rendering that supports iframes, complex PDFs, and accurate visual output.
omit_backgroundScrape API -
boolean
Default: false
Omit the background from loading.

scrollScrape API -
number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the wait_for parameters. Requires chrome request mode.
Use wait_for to scroll until a condition is met and disable_intercept to get data from the network regardless of hostname.
viewportScrape API -
object
Configure the viewport for chrome.
To emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414).
automation_scriptsScrape API -
object
Run custom web automated tasks on certain paths. Requires chrome or smart request mode.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
evaluate_on_new_documentScrape API -
string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/scrape', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "duration_elapsed_ms": 122,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Unblocker

Details

Start unblocking challenging website(s) to collect data. You can pass an array of objects for the request body. Cost 10-40 credits additional per success.

POSThttps://api.spider.cloud/unblocker

urlUnblocker API -
stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
disable_hintsUnblocker API -
boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
If you're tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
lite_modeUnblocker API -
boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklistUnblocker API -
string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
network_whitelistUnblocker API -
string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.

requestUnblocker API -
string
Default: smart
http
chrome
smart
The request type to perform. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
The request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults.
metadataUnblocker API -
boolean
Default: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
Using metadata can help extract critical information to use for AI.
sessionUnblocker API -
boolean
Default: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
request_timeoutUnblocker API -
number
Default: 60
The timeout to use for request. Timeouts can be from 5-255 seconds.
The timeout helps prevent long request times from hanging.
wait_forUnblocker API -
object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooksUnblocker API -
object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agentUnblocker API -
string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemapUnblocker API -
boolean
Default: false
Include the sitemap results to crawl.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_onlyUnblocker API -
boolean
Default: false
Only include the sitemap results to crawl.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_pathUnblocker API -
string
Default: sitemap.xml
The sitemap URL to use when using sitemap.
subdomainsUnblocker API -
boolean
Default: false
Allow subdomains to be included.
tldUnblocker API -
boolean
Default: false
Allow TLD's to be included.
root_selectorUnblocker API -
string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_hostUnblocker API -
boolean
Default: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
full_resourcesUnblocker API -
boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policyUnblocker API -
string
Default: Loose
Loose
Strict
None
The network redirect policy to use when performing HTTP request.
Loose will only capture the initial page redirect to the resource. Include the website in external_domains to allow crawling outside of the domain.
external_domainsUnblocker API -
array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selectorUnblocker API -
string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limitUnblocker API -
number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scriptsUnblocker API -
object
Run custom JavaScript on certain paths. Requires chrome or smart request mode. The values should be in the shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_interceptUnblocker API -
boolean
Default: false
Disable request interception when running request as chrome or smart. This may help bypass pages that use third-party scripts or external domains.
Cost and speed may increase when disabling this feature, as it removes native Chrome interception.
block_adsUnblocker API -
boolean
Default: true
Block advertisements when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_analyticsUnblocker API -
boolean
Default: true
Block analytics when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_stylesheetsUnblocker API -
boolean
Default: true
Block stylesheets when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
run_in_backgroundUnblocker API -
boolean
Default: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
Requires webhooks to be enabled.
chunking_algUnblocker API -
object
ByWords
ByLines
ByCharacterLength
BySentence
Use a chunking algorithm to segment your content output. Pass an object like { "type": "bysentence", "value": 2 } to split the text into an array by every 2 sentences. Works well with markdown or text formats.
The chunking algorithm allows you to prepare content for AI without needing extra code or loaders.
budgetUnblocker API -
object
Object that has paths with a counter for limiting the amount of pages. Use {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. { "/docs/colors": 10, "/docs/": 100 }.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_pageUnblocker API -
number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_allowedUnblocker API -
number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_trackerUnblocker API -
object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
blacklistUnblocker API -
array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
whitelistUnblocker API -
array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
crawl_timeoutUnblocker API -
object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
data_connectorsUnblocker API -
object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. { s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }
full_pageUnblocker API -
boolean
Default: true
Take a screenshot of the full page.

return_formatUnblocker API -
string | array
Default: raw
markdown
commonmark
raw
text
xml
bytes
empty
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use bytes or raw. PDF transformations may take up to 1 cent per page for high accuracy.
readabilityUnblocker API -
boolean
Default: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_mapUnblocker API -
object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
You can scrape using CSS selectors at no extra cost.
link_rewriteUnblocker API -
json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
clean_htmlUnblocker API -
boolean
Clean the HTML of unwanted attributes.
filter_svgUnblocker API -
boolean
Filter SVG elements from the markup.
filter_imagesUnblocker API -
boolean
Filter image elements from the markup.
return_json_dataUnblocker API -
boolean
Default: false
Return the JSON data found in scripts used for SSR.
Useful for getting JSON-ready data for LLMs and data from websites built with Next.js etc.
return_headersUnblocker API -
boolean
Default: false
Return the HTTP response headers with the results.
Getting the HTTP headers can help setup authentication flows.
return_cookiesUnblocker API -
boolean
Default: false
Return the HTTP response cookies with the results.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_linksUnblocker API -
boolean
Default: false
Return the links found on each page.
Getting the links can help index the reference locations found for the resource.
filter_output_svgUnblocker API -
boolean
Filter the svg tags from the output.
filter_output_imagesUnblocker API -
boolean
Filter the images from the output.
filter_output_main_onlyUnblocker API -
boolean
Default: true
Filter the nav, aside, and footer from the output.
encodingUnblocker API -
string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
return_embeddingsUnblocker API -
boolean
Default: false
Include OpenAI embeddings for title and description. Requires metadata to be enabled.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
binaryUnblocker API -
boolean
Return the image as binary instead of base64.
cdp_paramsUnblocker API -
object
Default: null
The settings to use to adjust clip, format, quality, and more.

proxyUnblocker API -
'residential' | 'mobile' | 'isp'
residential
mobile
isp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other proxy_* shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2 for mobile).
remote_proxyUnblocker API -
string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookiesUnblocker API -
string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headersUnblocker API -
object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
fingerprintUnblocker API -
boolean
Default: true
Use advanced fingerprint detection for chrome.
Set this value to help crawl when websites require a fingerprint.
stealthUnblocker API -
boolean
Default: true
Use stealth mode for headless chrome request to help prevent being blocked.
Set to true to almost guarantee not being detected by anything.
proxy_enabledUnblocker API -
boolean
Default: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5×.

cacheUnblocker API -
boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to true.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
Standard routes (/crawl, /scrape, /unblocker) — cache is true with skipBrowser enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set cache: false or { "skipBrowser": false }.
AI routes (/ai/crawl, /ai/scrape, etc.) — cache is true but skipBrowser is not enabled. AI routes always use the browser to ensure live page content for extraction.
Caching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses.
respect_robotsUnblocker API -
boolean
Default: true
Respect the robots.txt file for crawling.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Use this config sparingly.
skip_config_checksUnblocker API -
boolean
Default: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
service_worker_enabledUnblocker API -
boolean
Default: true
Allow the website to use Service Workers as needed.
Enabling service workers can allow websites that explicitly run background tasks to load data.
block_imagesUnblocker API -
boolean
Default: false
Block the images from loading to speed up the screenshot.
fastUnblocker API -
boolean
Default: true
Use fast screenshot mode for speed-optimized rendering. Set to false for high-fidelity rendering that supports iframes, complex PDFs, and accurate visual output.
omit_backgroundUnblocker API -
boolean
Default: false
Omit the background from loading.

scrollUnblocker API -
number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the wait_for parameters. Requires chrome request mode.
Use wait_for to scroll until a condition is met and disable_intercept to get data from the network regardless of hostname.
viewportUnblocker API -
object
Configure the viewport for chrome.
To emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414).
automation_scriptsUnblocker API -
object
Run custom web automated tasks on certain paths. Requires chrome or smart request mode.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
evaluate_on_new_documentUnblocker API -
string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/unblocker', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "url": "https://spider.cloud",
    "status": 200,
    "cookies": {
        "a": "something",
        "b": "something2"
    },
    "headers": {
        "x-id": 123,
        "x-cookie": 123
    },
    "status": 200,
    "costs": {
        "ai_cost": 0.001,
        "ai_cost_formatted": "0.0010",
        "bytes_transferred_cost": 3.1649999999999997e-9,
        "bytes_transferred_cost_formatted": "0.0000000031649999999999997240",
        "compute_cost": 0.0,
        "compute_cost_formatted": "0",
        "file_cost": 0.000029291250000000002,
        "file_cost_formatted": "0.0000292912499999999997868372",
        "total_cost": 0.0010292944150000001,
        "total_cost_formatted": "0.0010292944149999999997865612",
        "transform_cost": 0.0,
        "transform_cost_formatted": "0"
    },
    "content": "<html>...</html>",
    "error": null
  },
  // more content...
]

Search

Details

Perform a Google search to gather a list of websites for crawling and resource collection, including fallback options if the query yields no results. You can pass an array of objects for the request body. This endpoint is also available via Proxy-Mode.

POSThttps://api.spider.cloud/search

searchSearch API -
stringrequired
The search query you want to search for.
Search
limitSearch API -
number
Default: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
quick_searchSearch API -
boolean
Prioritize speed over output quantity.
disable_hintsSearch API -
boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
If you're tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
lite_modeSearch API -
boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklistSearch API -
string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
network_whitelistSearch API -
string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.

search_limitSearch API -
number
The limit amount of URLs to fetch or crawl from the search results. Remove the value or set it to 0 to crawl all URLs from the realtime search results. This is a shorthand if you do not want to use num.
fetch_page_contentSearch API -
boolean
Default: false
Fetch all the content of the websites by performing crawls. If disabled, only the search results are returned with the meta title and description.
requestSearch API -
string
Default: smart
http
chrome
smart
The request type to perform. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
The request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults.
depthSearch API -
number
Default: 25
The crawl limit for maximum depth. If 0, no limit will be applied.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadataSearch API -
boolean
Default: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
Using metadata can help extract critical information to use for AI.
sessionSearch API -
boolean
Default: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
request_timeoutSearch API -
number
Default: 60
The timeout to use for request. Timeouts can be from 5-255 seconds.
The timeout helps prevent long request times from hanging.
wait_forSearch API -
object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooksSearch API -
object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agentSearch API -
string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemapSearch API -
boolean
Default: false
Include the sitemap results to crawl.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_onlySearch API -
boolean
Default: false
Only include the sitemap results to crawl.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_pathSearch API -
string
Default: sitemap.xml
The sitemap URL to use when using sitemap.
subdomainsSearch API -
boolean
Default: false
Allow subdomains to be included.
tldSearch API -
boolean
Default: false
Allow TLD's to be included.
root_selectorSearch API -
string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_hostSearch API -
boolean
Default: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
full_resourcesSearch API -
boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policySearch API -
string
Default: Loose
Loose
Strict
None
The network redirect policy to use when performing HTTP request.
Loose will only capture the initial page redirect to the resource. Include the website in external_domains to allow crawling outside of the domain.
external_domainsSearch API -
array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selectorSearch API -
string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limitSearch API -
number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scriptsSearch API -
object
Run custom JavaScript on certain paths. Requires chrome or smart request mode. The values should be in the shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_interceptSearch API -
boolean
Default: false
Disable request interception when running request as chrome or smart. This may help bypass pages that use third-party scripts or external domains.
Cost and speed may increase when disabling this feature, as it removes native Chrome interception.
block_adsSearch API -
boolean
Default: true
Block advertisements when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_analyticsSearch API -
boolean
Default: true
Block analytics when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_stylesheetsSearch API -
boolean
Default: true
Block stylesheets when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
run_in_backgroundSearch API -
boolean
Default: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
Requires webhooks to be enabled.
chunking_algSearch API -
object
ByWords
ByLines
ByCharacterLength
BySentence
Use a chunking algorithm to segment your content output. Pass an object like { "type": "bysentence", "value": 2 } to split the text into an array by every 2 sentences. Works well with markdown or text formats.
The chunking algorithm allows you to prepare content for AI without needing extra code or loaders.
budgetSearch API -
object
Object that has paths with a counter for limiting the amount of pages. Use {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. { "/docs/colors": 10, "/docs/": 100 }.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_pageSearch API -
number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_allowedSearch API -
number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_trackerSearch API -
object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
blacklistSearch API -
array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
whitelistSearch API -
array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
auto_paginationSearch API -
boolean
Automatically paginates to fetch the exact number of desired results, as specified by the num parameter. Note that credit usage may increase, and response time may be slower when retrieving larger result sets.
crawl_timeoutSearch API -
object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
data_connectorsSearch API -
object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. { s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }
numSearch API -
number
The maximum number of results to return for the search.
pageSearch API -
number
The page number for the search results.
tbsSearch API -
'qdr:h' | 'qdr:d' | 'qdr:w' | 'qdr:m' | 'qdr:y'
Restrict results to a specific time range. Common options:qdr:h (past hour), qdr:d (past 24 hours), qdr:w (past week),qdr:m (past month), qdr:y (past year).

countrySearch API -
string
The country code to use for the search. It's a two-letter country code. (e.g. us for the United States).
locationSearch API -
string
The location from where you want the search to originate.
languageSearch API -
string
The language to use for the search. It's a two-letter language code (e.g., en for English).
country_codeSearch API -
string
Set a ISO country code for proxy connections. View the locations list for available countries.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
localeSearch API -
string
The locale to use for request, example en-US.

return_formatSearch API -
string | array
Default: raw
markdown
commonmark
raw
text
xml
bytes
empty
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use bytes or raw. PDF transformations may take up to 1 cent per page for high accuracy.
readabilitySearch API -
boolean
Default: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_mapSearch API -
object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
You can scrape using CSS selectors at no extra cost.
link_rewriteSearch API -
json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
clean_htmlSearch API -
boolean
Clean the HTML of unwanted attributes.
filter_svgSearch API -
boolean
Filter SVG elements from the markup.
filter_imagesSearch API -
boolean
Filter image elements from the markup.
return_json_dataSearch API -
boolean
Default: false
Return the JSON data found in scripts used for SSR.
Useful for getting JSON-ready data for LLMs and data from websites built with Next.js etc.
return_headersSearch API -
boolean
Default: false
Return the HTTP response headers with the results.
Getting the HTTP headers can help setup authentication flows.
return_cookiesSearch API -
boolean
Default: false
Return the HTTP response cookies with the results.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_linksSearch API -
boolean
Default: false
Return the links found on each page.
Getting the links can help index the reference locations found for the resource.
filter_output_svgSearch API -
boolean
Filter the svg tags from the output.
filter_output_imagesSearch API -
boolean
Filter the images from the output.
filter_output_main_onlySearch API -
boolean
Default: true
Filter the nav, aside, and footer from the output.
encodingSearch API -
string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
return_embeddingsSearch API -
boolean
Default: false
Include OpenAI embeddings for title and description. Requires metadata to be enabled.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxySearch API -
'residential' | 'mobile' | 'isp'
residential
mobile
isp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other proxy_* shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2 for mobile).
remote_proxySearch API -
string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookiesSearch API -
string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headersSearch API -
object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
fingerprintSearch API -
boolean
Default: true
Use advanced fingerprint detection for chrome.
Set this value to help crawl when websites require a fingerprint.
stealthSearch API -
boolean
Default: true
Use stealth mode for headless chrome request to help prevent being blocked.
Set to true to almost guarantee not being detected by anything.
proxy_enabledSearch API -
boolean
Default: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5×.

cacheSearch API -
boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to true.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
Standard routes (/crawl, /scrape, /unblocker) — cache is true with skipBrowser enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set cache: false or { "skipBrowser": false }.
AI routes (/ai/crawl, /ai/scrape, etc.) — cache is true but skipBrowser is not enabled. AI routes always use the browser to ensure live page content for extraction.
Caching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses.
delaySearch API -
number
Default: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
respect_robotsSearch API -
boolean
Default: true
Respect the robots.txt file for crawling.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Use this config sparingly.
skip_config_checksSearch API -
boolean
Default: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
service_worker_enabledSearch API -
boolean
Default: true
Allow the website to use Service Workers as needed.
Enabling service workers can allow websites that explicitly run background tasks to load data.

scrollSearch API -
number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the wait_for parameters. Requires chrome request mode.
Use wait_for to scroll until a condition is met and disable_intercept to get data from the network regardless of hostname.
viewportSearch API -
object
Configure the viewport for chrome.
To emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414).
automation_scriptsSearch API -
object
Run custom web automated tasks on certain paths. Requires chrome or smart request mode.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
evaluate_on_new_documentSearch API -
string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"search":"sports news today","search_limit":3,"limit":5,"return_format":"markdown"}

response = requests.post('https://api.spider.cloud/search', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "content": [
      {
          "description": "Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.",
          "title": "ESPN - Serving Sports Fans. Anytime. Anywhere.",
          "url": "https://www.espn.com/"
      },
      {
          "description": "Sports Illustrated, SI.com provides sports news, expert analysis, highlights, stats and scores for the NFL, NBA, MLB, NHL, college football, soccer,&nbsp;...",
          "title": "Sports Illustrated",
          "url": "https://www.si.com/"
      },
      {
          "description": "CBS Sports features live scoring, news, stats, and player info for NFL football, MLB baseball, NBA basketball, NHL hockey, college basketball and football.",
          "title": "CBS Sports - News, Live Scores, Schedules, Fantasy ...",
          "url": "https://www.cbssports.com/"
      },
      {
          "description": "Sport is a form of physical activity or game. Often competitive and organized, sports use, maintain, or improve physical ability and skills.",
          "title": "Sport",
          "url": "https://en.wikipedia.org/wiki/Sport"
      },
      {
          "description": "Watch FOX Sports and view live scores, odds, team news, player news, streams, videos, stats, standings &amp; schedules covering NFL, MLB, NASCAR, WWE, NBA, NHL,&nbsp;...",
          "title": "FOX Sports News, Scores, Schedules, Odds, Shows, Streams ...",
          "url": "https://www.foxsports.com/"
      },
      {
          "description": "Founded in 1974 by tennis legend, Billie Jean King, the Women's Sports Foundation is dedicated to creating leaders by providing girls access to sports.",
          "title": "Women's Sports Foundation: Home",
          "url": "https://www.womenssportsfoundation.org/"
      },
      {
          "description": "List of sports · Running. Marathon · Sprint · Mascot race · Airsoft · Laser tag · Paintball · Bobsleigh · Jack jumping · Luge · Shovel racing · Card stacking&nbsp;...",
          "title": "List of sports",
          "url": "https://en.wikipedia.org/wiki/List_of_sports"
      },
      {
          "description": "Stay up-to-date with the latest sports news and scores from NBC Sports.",
          "title": "NBC Sports - news, scores, stats, rumors, videos, and more",
          "url": "https://www.nbcsports.com/"
      },
      {
          "description": "r/sports: Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.",
          "title": "r/sports",
          "url": "https://www.reddit.com/r/sports/"
      },
      {
          "description": "The A-Z of sports covered by the BBC Sport team. Find all the latest live sports coverage, breaking news, results, scores, fixtures, tables,&nbsp;...",
          "title": "AZ Sport",
          "url": "https://www.bbc.com/sport/all-sports"
      }
  ]
}

Links

Details

Start crawling a website(s) to collect links found. You can pass an array of objects for the request body. This endpoint can save on latency if you only need to index the content URLs. Also available via Proxy-Mode.

POSThttps://api.spider.cloud/links

urlGet API -
stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limitGet API -
number
Default: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
disable_hintsGet API -
boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
If you're tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
lite_modeGet API -
boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklistGet API -
string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
network_whitelistGet API -
string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.

requestGet API -
string
Default: smart
http
chrome
smart
The request type to perform. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
The request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults.
depthGet API -
number
Default: 25
The crawl limit for maximum depth. If 0, no limit will be applied.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadataGet API -
boolean
Default: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
Using metadata can help extract critical information to use for AI.
sessionGet API -
boolean
Default: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
request_timeoutGet API -
number
Default: 60
The timeout to use for request. Timeouts can be from 5-255 seconds.
The timeout helps prevent long request times from hanging.
wait_forGet API -
object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooksGet API -
object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agentGet API -
string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemapGet API -
boolean
Default: false
Include the sitemap results to crawl.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_onlyGet API -
boolean
Default: false
Only include the sitemap results to crawl.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_pathGet API -
string
Default: sitemap.xml
The sitemap URL to use when using sitemap.
subdomainsGet API -
boolean
Default: false
Allow subdomains to be included.
tldGet API -
boolean
Default: false
Allow TLD's to be included.
root_selectorGet API -
string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_hostGet API -
boolean
Default: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
full_resourcesGet API -
boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policyGet API -
string
Default: Loose
Loose
Strict
None
The network redirect policy to use when performing HTTP request.
Loose will only capture the initial page redirect to the resource. Include the website in external_domains to allow crawling outside of the domain.
external_domainsGet API -
array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selectorGet API -
string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limitGet API -
number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scriptsGet API -
object
Run custom JavaScript on certain paths. Requires chrome or smart request mode. The values should be in the shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_interceptGet API -
boolean
Default: false
Disable request interception when running request as chrome or smart. This may help bypass pages that use third-party scripts or external domains.
Cost and speed may increase when disabling this feature, as it removes native Chrome interception.
block_adsGet API -
boolean
Default: true
Block advertisements when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_analyticsGet API -
boolean
Default: true
Block analytics when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_stylesheetsGet API -
boolean
Default: true
Block stylesheets when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
run_in_backgroundGet API -
boolean
Default: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
Requires webhooks to be enabled.
chunking_algGet API -
object
ByWords
ByLines
ByCharacterLength
BySentence
Use a chunking algorithm to segment your content output. Pass an object like { "type": "bysentence", "value": 2 } to split the text into an array by every 2 sentences. Works well with markdown or text formats.
The chunking algorithm allows you to prepare content for AI without needing extra code or loaders.
budgetGet API -
object
Object that has paths with a counter for limiting the amount of pages. Use {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. { "/docs/colors": 10, "/docs/": 100 }.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_pageGet API -
number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_allowedGet API -
number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_trackerGet API -
object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
blacklistGet API -
array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
whitelistGet API -
array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
crawl_timeoutGet API -
object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
data_connectorsGet API -
object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. { s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }

return_formatGet API -
string | array
Default: raw
markdown
commonmark
raw
text
xml
bytes
empty
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use bytes or raw. PDF transformations may take up to 1 cent per page for high accuracy.
readabilityGet API -
boolean
Default: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_mapGet API -
object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
You can scrape using CSS selectors at no extra cost.
link_rewriteGet API -
json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
clean_htmlGet API -
boolean
Clean the HTML of unwanted attributes.
filter_svgGet API -
boolean
Filter SVG elements from the markup.
filter_imagesGet API -
boolean
Filter image elements from the markup.
return_json_dataGet API -
boolean
Default: false
Return the JSON data found in scripts used for SSR.
Useful for getting JSON-ready data for LLMs and data from websites built with Next.js etc.
return_headersGet API -
boolean
Default: false
Return the HTTP response headers with the results.
Getting the HTTP headers can help setup authentication flows.
return_cookiesGet API -
boolean
Default: false
Return the HTTP response cookies with the results.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_linksGet API -
boolean
Default: false
Return the links found on each page.
Getting the links can help index the reference locations found for the resource.
filter_output_svgGet API -
boolean
Filter the svg tags from the output.
filter_output_imagesGet API -
boolean
Filter the images from the output.
filter_output_main_onlyGet API -
boolean
Default: true
Filter the nav, aside, and footer from the output.
encodingGet API -
string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
return_embeddingsGet API -
boolean
Default: false
Include OpenAI embeddings for title and description. Requires metadata to be enabled.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxyGet API -
'residential' | 'mobile' | 'isp'
residential
mobile
isp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other proxy_* shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2 for mobile).
remote_proxyGet API -
string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookiesGet API -
string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headersGet API -
object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
fingerprintGet API -
boolean
Default: true
Use advanced fingerprint detection for chrome.
Set this value to help crawl when websites require a fingerprint.
stealthGet API -
boolean
Default: true
Use stealth mode for headless chrome request to help prevent being blocked.
Set to true to almost guarantee not being detected by anything.
proxy_enabledGet API -
boolean
Default: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5×.

cacheGet API -
boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to true.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
Standard routes (/crawl, /scrape, /unblocker) — cache is true with skipBrowser enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set cache: false or { "skipBrowser": false }.
AI routes (/ai/crawl, /ai/scrape, etc.) — cache is true but skipBrowser is not enabled. AI routes always use the browser to ensure live page content for extraction.
Caching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses.
delayGet API -
number
Default: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
respect_robotsGet API -
boolean
Default: true
Respect the robots.txt file for crawling.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Use this config sparingly.
skip_config_checksGet API -
boolean
Default: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
service_worker_enabledGet API -
boolean
Default: true
Allow the website to use Service Workers as needed.
Enabling service workers can allow websites that explicitly run background tasks to load data.

scrollGet API -
number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the wait_for parameters. Requires chrome request mode.
Use wait_for to scroll until a condition is met and disable_intercept to get data from the network regardless of hostname.
viewportGet API -
object
Configure the viewport for chrome.
To emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414).
automation_scriptsGet API -
object
Run custom web automated tasks on certain paths. Requires chrome or smart request mode.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
evaluate_on_new_documentGet API -
string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/links', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "url": "https://spider.cloud",
    "status": 200,
    "duration_elasped_ms": 112
    "error": null
  },
  // more content...
]

Screenshot

Details

Take screenshots of a website to base64 or binary encoding. You can pass an array of objects for the request body. This endpoint is also available via Proxy-Mode.

POSThttps://api.spider.cloud/screenshot

urlScreenshot API -
stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limitScreenshot API -
number
Default: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
disable_hintsScreenshot API -
boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
If you're tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
lite_modeScreenshot API -
boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklistScreenshot API -
string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
network_whitelistScreenshot API -
string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.

depthScreenshot API -
number
Default: 25
The crawl limit for maximum depth. If 0, no limit will be applied.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadataScreenshot API -
boolean
Default: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
Using metadata can help extract critical information to use for AI.
sessionScreenshot API -
boolean
Default: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
request_timeoutScreenshot API -
number
Default: 60
The timeout to use for request. Timeouts can be from 5-255 seconds.
The timeout helps prevent long request times from hanging.
wait_forScreenshot API -
object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooksScreenshot API -
object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agentScreenshot API -
string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemapScreenshot API -
boolean
Default: false
Include the sitemap results to crawl.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_onlyScreenshot API -
boolean
Default: false
Only include the sitemap results to crawl.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_pathScreenshot API -
string
Default: sitemap.xml
The sitemap URL to use when using sitemap.
subdomainsScreenshot API -
boolean
Default: false
Allow subdomains to be included.
tldScreenshot API -
boolean
Default: false
Allow TLD's to be included.
root_selectorScreenshot API -
string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_hostScreenshot API -
boolean
Default: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
full_resourcesScreenshot API -
boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policyScreenshot API -
string
Default: Loose
Loose
Strict
None
The network redirect policy to use when performing HTTP request.
Loose will only capture the initial page redirect to the resource. Include the website in external_domains to allow crawling outside of the domain.
external_domainsScreenshot API -
array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selectorScreenshot API -
string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limitScreenshot API -
number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scriptsScreenshot API -
object
Run custom JavaScript on certain paths. Requires chrome or smart request mode. The values should be in the shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_interceptScreenshot API -
boolean
Default: false
Disable request interception when running request as chrome or smart. This may help bypass pages that use third-party scripts or external domains.
Cost and speed may increase when disabling this feature, as it removes native Chrome interception.
block_adsScreenshot API -
boolean
Default: true
Block advertisements when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_analyticsScreenshot API -
boolean
Default: true
Block analytics when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_stylesheetsScreenshot API -
boolean
Default: true
Block stylesheets when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
run_in_backgroundScreenshot API -
boolean
Default: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
Requires webhooks to be enabled.
chunking_algScreenshot API -
object
ByWords
ByLines
ByCharacterLength
BySentence
Use a chunking algorithm to segment your content output. Pass an object like { "type": "bysentence", "value": 2 } to split the text into an array by every 2 sentences. Works well with markdown or text formats.
The chunking algorithm allows you to prepare content for AI without needing extra code or loaders.
budgetScreenshot API -
object
Object that has paths with a counter for limiting the amount of pages. Use {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. { "/docs/colors": 10, "/docs/": 100 }.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_pageScreenshot API -
number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_allowedScreenshot API -
number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_trackerScreenshot API -
object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
blacklistScreenshot API -
array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
whitelistScreenshot API -
array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
crawl_timeoutScreenshot API -
object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
data_connectorsScreenshot API -
object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. { s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }
full_pageScreenshot API -
boolean
Default: true
Take a screenshot of the full page.

css_extraction_mapScreenshot API -
object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
You can scrape using CSS selectors at no extra cost.
link_rewriteScreenshot API -
json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
clean_htmlScreenshot API -
boolean
Clean the HTML of unwanted attributes.
filter_svgScreenshot API -
boolean
Filter SVG elements from the markup.
filter_imagesScreenshot API -
boolean
Filter image elements from the markup.
return_json_dataScreenshot API -
boolean
Default: false
Return the JSON data found in scripts used for SSR.
Useful for getting JSON-ready data for LLMs and data from websites built with Next.js etc.
return_headersScreenshot API -
boolean
Default: false
Return the HTTP response headers with the results.
Getting the HTTP headers can help setup authentication flows.
return_cookiesScreenshot API -
boolean
Default: false
Return the HTTP response cookies with the results.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_linksScreenshot API -
boolean
Default: false
Return the links found on each page.
Getting the links can help index the reference locations found for the resource.
filter_output_svgScreenshot API -
boolean
Filter the svg tags from the output.
filter_output_imagesScreenshot API -
boolean
Filter the images from the output.
filter_output_main_onlyScreenshot API -
boolean
Default: true
Filter the nav, aside, and footer from the output.
encodingScreenshot API -
string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
return_embeddingsScreenshot API -
boolean
Default: false
Include OpenAI embeddings for title and description. Requires metadata to be enabled.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
binaryScreenshot API -
boolean
Return the image as binary instead of base64.
cdp_paramsScreenshot API -
object
Default: null
The settings to use to adjust clip, format, quality, and more.

proxyScreenshot API -
'residential' | 'mobile' | 'isp'
residential
mobile
isp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other proxy_* shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2 for mobile).
remote_proxyScreenshot API -
string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookiesScreenshot API -
string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headersScreenshot API -
object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
fingerprintScreenshot API -
boolean
Default: true
Use advanced fingerprint detection for chrome.
Set this value to help crawl when websites require a fingerprint.
stealthScreenshot API -
boolean
Default: true
Use stealth mode for headless chrome request to help prevent being blocked.
Set to true to almost guarantee not being detected by anything.
proxy_enabledScreenshot API -
boolean
Default: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5×.

cacheScreenshot API -
boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to true.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
Standard routes (/crawl, /scrape, /unblocker) — cache is true with skipBrowser enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set cache: false or { "skipBrowser": false }.
AI routes (/ai/crawl, /ai/scrape, etc.) — cache is true but skipBrowser is not enabled. AI routes always use the browser to ensure live page content for extraction.
Caching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses.
delayScreenshot API -
number
Default: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
respect_robotsScreenshot API -
boolean
Default: true
Respect the robots.txt file for crawling.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Use this config sparingly.
skip_config_checksScreenshot API -
boolean
Default: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
service_worker_enabledScreenshot API -
boolean
Default: true
Allow the website to use Service Workers as needed.
Enabling service workers can allow websites that explicitly run background tasks to load data.
block_imagesScreenshot API -
boolean
Default: false
Block the images from loading to speed up the screenshot.
fastScreenshot API -
boolean
Default: true
Use fast screenshot mode for speed-optimized rendering. Set to false for high-fidelity rendering that supports iframes, complex PDFs, and accurate visual output.
omit_backgroundScreenshot API -
boolean
Default: false
Omit the background from loading.

scrollScreenshot API -
number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the wait_for parameters. Requires chrome request mode.
Use wait_for to scroll until a condition is met and disable_intercept to get data from the network regardless of hostname.
viewportScreenshot API -
object
Configure the viewport for chrome.
To emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414).
automation_scriptsScreenshot API -
object
Run custom web automated tasks on certain paths. Requires chrome or smart request mode.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
evaluate_on_new_documentScreenshot API -
string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/screenshot', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "duration_elapsed_ms": 122,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Transform HTML

Details

Transform HTML into Markdown or plain text quickly. Each HTML transformation starts at 0.1 credits, while PDF transformations can cost up to 10 credits per page. You can submit up to 10 MB of data per request. The Transform API is also integrated into the /crawl endpoint via the return_format parameter.

POSThttps://api.spider.cloud/transform

dataTransform API -
objectrequired
A list of html data to transform. The object list takes the keys html and url. The url key is optional and only used when the readability is enabled.
Data<html><body> <h1>Example Website</h1> <p>This is some example markup to use to test the transform function.</p> <p><a href="https://spider.cloud/guides">Guides</a></p> </body></html>

return_formatTransform API -
string | array
Default: raw
markdown
commonmark
raw
text
xml
bytes
empty
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use bytes or raw. PDF transformations may take up to 1 cent per page for high accuracy.
readabilityTransform API -
boolean
Default: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
clean_fullTransform API -
boolean
Default: false
Clean the HTML fully of unwanted attributes.
cleanTransform API -
boolean
Default: false
Clean the markdown or text for AI removing footers, navigation, and more.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","data":[{"html":"<html><body>\n<h1>Example Website</h1>\n<p>This is some example markup to use to test the transform function.</p>\n<p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</body></html>","url":"https://example.com"}]}

response = requests.post('https://api.spider.cloud/transform', 
  headers=headers, json=json_data)

print(response.json())

Response

{
    "content": [
      "# Example Website
This is some example markup to use to test the transform function.
[Guides](https://spider.cloud/guides)"
    ],
    "cost": {
        "ai_cost": 0,
        "compute_cost": 0,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0.0001
    },
    "error": null,
    "status": 200
  }

Proxy-Mode

Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance and residential proxies up to 10GB per/s. Take a look at all of our proxy locations to see if we support the country.

Proxy-Mode works with all core endpoints: Crawl, Scrape, Screenshot, Search, and Links. Pass API parameters in the password field to configure rendering, proxies, and more.

**HTTP address**: proxy.spider.cloud:80**HTTPS address**: proxy.spider.cloud:443**Username**: YOUR-API-KEY**Password**: PARAMETERS

•Residential — real-user IPs across 100+ countries. High anonymity, up to 1 GB/s. $1–4/GB
•ISP — stable datacenter IPs with ISP-grade routing. Highest performance, up to 10 GB/s. $1/GB
•Mobile — real 4G/5G device IPs for maximum stealth. $2/GB

Use country_code to set geolocation and proxy to select the pool type.

Proxy Type	Price	Multiplier	Description
residential	$2.00/GB	×2-x4	Entry-level residential pool
mobile	$2.00/GB	×2	4G/5G mobile proxies for stealth
isp	$1.00/GB	×1	ISP-grade residential routing

Example proxy request

import requests, os


# Proxy configuration
proxies = {
    'http': f"http://{os.getenv('SPIDER_API_KEY')}:[email protected]:8888",
    'https': f"https://{os.getenv('SPIDER_API_KEY')}:[email protected]:8889"
}

# Function to make a request through the proxy
def get_via_proxy(url):
    try:
        response = requests.get(url, proxies=proxies)
        response.raise_for_status()
        print('Response HTTP Status Code: ', response.status_code)
        print('Response HTTP Response Body: ', response.content)
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return None

# Example usage
if __name__ == "__main__":
     get_via_proxy("https://www.example.com")
     get_via_proxy("https://www.example.com/community")

Browser

Spider Browser is a Rust-based cloud browser for automation, scraping, and AI extraction. Connect via the browser.spider.cloud WebSocket endpoint using any Playwright or Puppeteer compatible client, or use the spider-browser TypeScript library for a higher-level API with built-in AI actions.

**WebSocket endpoint**: wss://browser.spider.cloud/v1/browser**Authentication**: ?token=YOUR-API-KEY**Protocol**: CDP WebDriver BiDi

•AI extraction & actions — extract structured data or perform actions with natural language. Vision models handle complex pages.
•Stealth & proxies — automatic fingerprint rotation, residential proxies, and a retry engine that recovers sessions on its own.
•100 concurrent browsers — per user on all plans. Pass stealth, browser, and country query params to configure each session.

Sessions can be recorded and replayed from the dashboard. See the spider-browser repo for full documentation and examples.

Basic usage — AI extract & act

import { SpiderBrowser } from "spider-browser"

const browser = new SpiderBrowser({
  apiKey: process.env.SPIDER_API_KEY!,
})
await browser.init()
await browser.page.goto("https://example.com")

// extract structured data with AI
const prices = await browser.extract("Get all product prices")

// perform actions with natural language
await browser.act("Add the cheapest item to the cart")

// take a screenshot
const screenshot = await browser.page.screenshot()

await browser.close()

Scrape & interact

import { SpiderBrowser } from "spider-browser"

const browser = new SpiderBrowser({
  apiKey: process.env.SPIDER_API_KEY!,
})
await browser.init()

// navigate and interact with the page
await browser.page.goto("https://example.com/search")
await browser.page.fill("input[name=q]", "web scraping")
await browser.page.press("Enter")
await browser.page.waitForSelector(".results")

// extract structured fields from the DOM
const data = await browser.page.extractFields({
  title: "h1",
  description: ".description",
  image: { selector: "img.hero", attribute: "src" },
})

await browser.close()

Session recording

import { SpiderBrowser } from "spider-browser"

// Enable session recording — replay later in the dashboard
const browser = new SpiderBrowser({
  apiKey: process.env.SPIDER_API_KEY!,
  record: true, // screencast + interaction capture
})
await browser.init()

await browser.page.goto("https://example.com")
await browser.act("Click the login button")
await browser.act("Fill in the email field with [email protected]")

// Recording is automatically saved when the session ends
await browser.close()
// View recordings at spider.cloud/account/recordings

Queries

Query the data that you collect during crawling and scraping. Add dynamic filters for extracting exactly what is needed.

Logs

Get the last 24 hours of logs.

GEThttps://api.spider.cloud/data/crawl_logs

urlLogs API -
string
Filter a single url record.
Test Url
limitLogs API -
string
The limit of records to get.
Crawl Limit
domainLogs API -
string
Filter a single domain record.
pageLogs API -
number
The current page to get.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crawl_logs?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "domain": "spider.cloud",
    "url": "https://spider.cloud",
    "links": 1,
    "credits_used": 3,
    "mode": 2,
    "crawl_duration": 340,
    "message": null,
    "request_user_agent": "Spider",
    "level": "UI",
    "status_code": 0,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  },
  "error": null
}

Credits

Get the remaining credits available.

GEThttps://api.spider.cloud/data/credits

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/credits?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "credits": 53334,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  }
}

Scraper Configs Alpha

Browse optimized scraper configs for popular websites. Each config defines extraction rules (selectors, AI prompts, stealth settings, and more) curated for the best results out of the box.

Scraper Directory Alpha

Browse optimized scraper configs for popular websites. Filter by domain, category, or search term. Each config is curated to deliver the best extraction results out of the box. No authentication required.

GEThttps://api.spider.cloud/data/scraper-directory

urlScraper API -
string
Filter a single url record.
Test Url
limitScraper API -
string
The limit of records to get.
Crawl Limit
domainScraper API -
string
Filter a single domain record.
pageScraper API -
number
The current page to get.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/scraper-directory?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "domain": "example.com",
      "path_pattern": "/blog/*",
      "display_name": "Example Blog Scraper",
      "description": "Extracts blog posts with title, author, and content.",
      "category": "news",
      "tags": ["blog", "articles"],
      "confidence_score": 0.95,
      "validation_count": 12,
      "slug": "example-com-blog",
      "created_at": "2025-12-01T10:00:00+00:00",
      "updated_at": "2026-01-15T08:30:00+00:00"
    }
  ],
  "total": 1,
  "page": 1,
  "limit": 20,
  "total_pages": 1
}

Fetch API Alpha

Per-website scraper endpoints that auto-configure themselves. POST /fetch/{domain}/{path} — AI discovers optimal CSS selectors, extraction schemas, and request settings on the first request, then caches and reuses them for fast, consistent structured data. Full documentation →

POSThttps://api.spider.cloud/fetch/example.com/

urlFetch API -
stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limitFetch API -
number
Default: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
disable_hintsFetch API -
boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
If you're tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
lite_modeFetch API -
boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
network_blacklistFetch API -
string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
network_whitelistFetch API -
string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.

requestFetch API -
string
Default: smart
http
chrome
smart
The request type to perform. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
The request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults.
depthFetch API -
number
Default: 25
The crawl limit for maximum depth. If 0, no limit will be applied.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadataFetch API -
boolean
Default: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
Using metadata can help extract critical information to use for AI.
sessionFetch API -
boolean
Default: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
request_timeoutFetch API -
number
Default: 60
The timeout to use for request. Timeouts can be from 5-255 seconds.
The timeout helps prevent long request times from hanging.
wait_forFetch API -
object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooksFetch API -
object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agentFetch API -
string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemapFetch API -
boolean
Default: false
Include the sitemap results to crawl.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_onlyFetch API -
boolean
Default: false
Only include the sitemap results to crawl.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_pathFetch API -
string
Default: sitemap.xml
The sitemap URL to use when using sitemap.
subdomainsFetch API -
boolean
Default: false
Allow subdomains to be included.
tldFetch API -
boolean
Default: false
Allow TLD's to be included.
root_selectorFetch API -
string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_hostFetch API -
boolean
Default: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
full_resourcesFetch API -
boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policyFetch API -
string
Default: Loose
Loose
Strict
None
The network redirect policy to use when performing HTTP request.
Loose will only capture the initial page redirect to the resource. Include the website in external_domains to allow crawling outside of the domain.
external_domainsFetch API -
array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selectorFetch API -
string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limitFetch API -
number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scriptsFetch API -
object
Run custom JavaScript on certain paths. Requires chrome or smart request mode. The values should be in the shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_interceptFetch API -
boolean
Default: false
Disable request interception when running request as chrome or smart. This may help bypass pages that use third-party scripts or external domains.
Cost and speed may increase when disabling this feature, as it removes native Chrome interception.
block_adsFetch API -
boolean
Default: true
Block advertisements when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_analyticsFetch API -
boolean
Default: true
Block analytics when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
block_stylesheetsFetch API -
boolean
Default: true
Block stylesheets when running request as chrome or smart. This can greatly increase performance.
Cost and speed might increase when disabling this feature.
run_in_backgroundFetch API -
boolean
Default: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
Requires webhooks to be enabled.
chunking_algFetch API -
object
ByWords
ByLines
ByCharacterLength
BySentence
Use a chunking algorithm to segment your content output. Pass an object like { "type": "bysentence", "value": 2 } to split the text into an array by every 2 sentences. Works well with markdown or text formats.
The chunking algorithm allows you to prepare content for AI without needing extra code or loaders.
budgetFetch API -
object
Object that has paths with a counter for limiting the amount of pages. Use {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g. { "/docs/colors": 10, "/docs/": 100 }.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_pageFetch API -
number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
max_credits_allowedFetch API -
number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_trackerFetch API -
object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
blacklistFetch API -
array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
whitelistFetch API -
array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
crawl_timeoutFetch API -
object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
data_connectorsFetch API -
object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. { s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }

return_formatFetch API -
string | array
Default: raw
markdown
commonmark
raw
text
xml
bytes
empty
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use bytes or raw. PDF transformations may take up to 1 cent per page for high accuracy.
readabilityFetch API -
boolean
Default: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_mapFetch API -
object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
You can scrape using CSS selectors at no extra cost.
link_rewriteFetch API -
json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
clean_htmlFetch API -
boolean
Clean the HTML of unwanted attributes.
filter_svgFetch API -
boolean
Filter SVG elements from the markup.
filter_imagesFetch API -
boolean
Filter image elements from the markup.
return_json_dataFetch API -
boolean
Default: false
Return the JSON data found in scripts used for SSR.
Useful for getting JSON-ready data for LLMs and data from websites built with Next.js etc.
return_headersFetch API -
boolean
Default: false
Return the HTTP response headers with the results.
Getting the HTTP headers can help setup authentication flows.
return_cookiesFetch API -
boolean
Default: false
Return the HTTP response cookies with the results.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_linksFetch API -
boolean
Default: false
Return the links found on each page.
Getting the links can help index the reference locations found for the resource.
filter_output_svgFetch API -
boolean
Filter the svg tags from the output.
filter_output_imagesFetch API -
boolean
Filter the images from the output.
filter_output_main_onlyFetch API -
boolean
Default: true
Filter the nav, aside, and footer from the output.
encodingFetch API -
string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
return_embeddingsFetch API -
boolean
Default: false
Include OpenAI embeddings for title and description. Requires metadata to be enabled.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxyFetch API -
'residential' | 'mobile' | 'isp'
residential
mobile
isp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other proxy_* shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2 for mobile).
remote_proxyFetch API -
string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookiesFetch API -
string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headersFetch API -
object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
fingerprintFetch API -
boolean
Default: true
Use advanced fingerprint detection for chrome.
Set this value to help crawl when websites require a fingerprint.
stealthFetch API -
boolean
Default: true
Use stealth mode for headless chrome request to help prevent being blocked.
Set to true to almost guarantee not being detected by anything.
proxy_enabledFetch API -
boolean
Default: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5×.

cacheFetch API -
boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to true.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
Standard routes (/crawl, /scrape, /unblocker) — cache is true with skipBrowser enabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, set cache: false or { "skipBrowser": false }.
AI routes (/ai/crawl, /ai/scrape, etc.) — cache is true but skipBrowser is not enabled. AI routes always use the browser to ensure live page content for extraction.
Caching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses.
delayFetch API -
number
Default: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
respect_robotsFetch API -
boolean
Default: true
Respect the robots.txt file for crawling.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Use this config sparingly.
skip_config_checksFetch API -
boolean
Default: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
service_worker_enabledFetch API -
boolean
Default: true
Allow the website to use Service Workers as needed.
Enabling service workers can allow websites that explicitly run background tasks to load data.

scrollFetch API -
number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the wait_for parameters. Requires chrome request mode.
Use wait_for to scroll until a condition is met and disable_intercept to get data from the network regardless of hostname.
viewportFetch API -
object
Configure the viewport for chrome.
To emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414).
automation_scriptsFetch API -
object
Run custom web automated tasks on certain paths. Requires chrome or smart request mode.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
evaluate_on_new_documentFetch API -
string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/fetch/example.com/', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "url": "https://example.com/",
    "status": 200,
    "content": "{\n  \"title\": \"Example Domain\",\n  \"description\": \"This domain is for use in illustrative examples.\",\n  \"links\": [\"https://www.iana.org/domains/example\"]\n}",
    "error": null,
    "costs": {
      "total_cost": 0.001,
      "total_cost_formatted": "0.0010"
    }
  }
]