Smith Scraper API

Professional Web Scraping REST API

Version 3.0.0

๐Ÿ“š Getting Started

Welcome to Smith Scraper API v3.0

Professional web scraping API with curl-impersonate, SOCKS5 proxies, intelligent user-agent rotation, and anti-bot evasion.

๐ŸŒ Base URL

https://smith.urbistat.com/api/v1

โœจ Key Features

curl-impersonate

Perfect TLS fingerprinting matching real browsers

SOCKS5 Proxies

Rotating IPs with Lightning proxy (IT region)

Mobile Emulation

Automatic mobile/desktop UA rotation (90-98% success)

Bulk Upload

Process up to 900,000 URLs per job via file

Smart Retry

Automatic retry for 403 Forbidden errors

XLSX Export

Download results with full HTML content

๐Ÿ“Š Success Rates

Configuration Success Rate Speed Use Case
Mobile + 50 thread + Proxy 95+% Fast Maximum success โญ RECOMMENDED
Mobile + 150 threads + Proxy 70-75% Medium
Mobile + No proxy โŒ 10-20% Fast DO NOT USE

๐Ÿšฆ HTTP Status Codes

200 OK Success
201 Created Resource created
400 Bad Request Invalid parameters
401 Unauthorized Authentication failed
404 Not Found Resource not found
422 Unprocessable Validation error
500 Server Error Internal error

๐Ÿ” Authentication

Bearer Token (Recommended)

Include your API key as Bearer token in the Authorization header.

Headers Required

Authorization: Bearer sk_your_api_key_here
Content-Type: application/json

Alternative: X-API-Key Header

X-API-Key: sk_your_api_key_here
Content-Type: application/json
Security: Never share your API key. Keep it private and secure.

๐Ÿš€ Quick Start Guide

1๏ธโƒฃ Test Single URL (Scraper Tests)

Quick validation before production runs

curl -X POST "https://smith.urbistat.com/api/v1/scraper_tests/test" \
  -H "Authorization: Bearer sk_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "target_url": "https://example.com",
    "mobile": true,
    "debug": true
  }'

2๏ธโƒฃ Create Production Job (File Upload - RECOMMENDED)

For 10+ URLs, use file upload for best performance

curl -X POST "https://smith.urbistat.com/api/v1/scraper" \
  -H "Authorization: Bearer sk_your_key" \
  -F "urls_file=@urls.txt" \
  -F "name=My Scraping Job" \
  -F "mobile=true" \
  -F "threads=2" \
  -F "auto_start=true"

3๏ธโƒฃ Check Job Status

curl -X GET "https://smith.urbistat.com/api/v1/scraper/{job_id}/status" \
  -H "Authorization: Bearer sk_your_key"

4๏ธโƒฃ Download Results

Generate export file and download

# Step 1: Generate export
curl -X GET "https://smith.urbistat.com/api/v1/scraper/{job_id}/export_xlsx" \
  -H "Authorization: Bearer sk_your_key"

# Step 2: Download file (no auth required)
curl "https://smith.urbistat.com/api/v1/scraper/download/{token}" \
  -o results.xlsx

๐ŸŒ List Proxies

GET

List All Proxies

GET /proxies

Get list of all configured proxies with details.

Note: Password fields are always censored (********) in responses for security.

Response Example

{
  "success": true,
  "message": "Proxies retrieved",
  "data": {
    "proxies": [
      {
        "id": 1,
        "host": "res-eu.lightningproxies.net",
        "port": 9999,
        "username": "user-zone-lightning-region-it",
        "password": "********",
        "protocol": "socks5",
        "active": true,
        "requires_auth": true,
        "primary": true,
        "debug": false,
        "created_at": "2024-01-15T10:30:00Z"
      }
    ],
    "total": 1
  }
}

โญ Primary Proxy

GET

Get Primary Proxy

GET /proxies/primary

Get the currently configured primary proxy. This is the proxy used by default when proxy: true without specifying proxy_id.

Use this endpoint to verify which proxy your jobs will use by default.

Response Example

{
  "success": true,
  "message": "Primary proxy retrieved",
  "data": {
    "id": 1,
    "host": "res-eu.lightningproxies.net",
    "port": 9999,
    "username": "user-zone-lightning-region-it",
    "password": "********",
    "protocol": "socks5",
    "requires_auth": true,
    "active": true,
    "primary": true
  }
}
POST

Set Proxy as Primary

POST /proxies/{id}/make_primary

Set a proxy as the primary proxy. This will remove primary flag from all other proxies.

โž• Add Proxy

POST

Add New Proxy

POST /proxies

Add a new proxy configuration.

Request Parameters

Parameter Type Required Description
proxy_host string Required Proxy hostname or IP
proxy_port integer Required Proxy port number
proxy_username string Optional Authentication username
proxy_password string Optional Authentication password
proxy_protocol string Optional socks5 (default), http, https
active boolean Optional Enable proxy (default: true)
requires_auth boolean Required Enable proxy (default: true)

Request Example

{
  "proxy_host": "res-eu.lightningproxies.net",
  "proxy_port": 9999,
  "proxy_username": "user-zone-lightning-region-it",
  "proxy_password": "your_password",
  "proxy_protocol": "socks5",
  "active": true,
  "requires_auth": true
}

๐Ÿงช Test Proxy

POST

Test Proxy Connection

POST /proxies/{id}/test

Test proxy connectivity and get exit IP information.

Tests against multiple IP detection services (api.ipify.org, ifconfig.me, icanhazip.com)

Response Example - Success

{
  "success": true,
  "message": "Proxy test completed",
  "data": {
    "proxy_id": 1,
    "test_results": [
      {
        "success": true,
        "ip": "185.123.45.67",
        "country": "IT",
        "response_time": 0.45,
        "error": null
      }
    ],
    "all_passed": true
  }
}

๐Ÿงช Create Test Job

POST

Create Scraper Test

POST /scraper_tests/test

Create a quick test job for URL validation with detailed debug info.

Request Parameters

Parameter Type Default Description
target_url string - Single URL or comma-separated URLs (alternative to file)
urls_file file - Text file with URLs (multipart/form-data)
mobile boolean true Use mobile user agents (RECOMMENDED)
threads integer 1 Parallel threads (1-10)
debug boolean false Enable detailed debug output
proxy boolean true Enable proxy (ALWAYS RECOMMENDED)

JSON Request Example

{
  "target_url": "https://www.immobiliare.it/annunci/122249824/",
  "mobile": true,
  "debug": true,
  "threads": 1
}

File Upload Example (RECOMMENDED for multiple URLs)

curl -X POST "https://smith.urbistat.com/api/v1/scraper_tests/test" \
  -H "Authorization: Bearer sk_your_key" \
  -F "urls_file=@test_urls.txt" \
  -F "mobile=true" \
  -F "debug=true" \
  -F "threads=2"

โœ… Check Test Results

GET

Get Test Job Results

GET /scraper_tests/{job_id}/check

Get test job execution status and results with detailed debug information.

Query Parameters

Parameter Type Default Description
detailed boolean false Include full URL results with body previews
include_debug boolean true Include debug information

Response Example

{
  "success": true,
  "data": {
    "id": "a3bb189e-8bf9-3888-9912-ace4e6543002",
    "status": "completed",
    "total_urls": 1,
    "processed_urls": 1,
    "successful": 1,
    "errors_403": 0,
    "errors_404": 0,
    "total_time_spent": 2.45,
    "avg_time_per_url": 2.45
  }
}

๐Ÿ“ฅ Export Test Results

GET

Export Test to XLSX

GET /scraper_tests/{job_id}/export_xlsx

Download test results as Excel file with full HTML body content.

Direct download (no token required for test jobs)

cURL Example

curl -X GET "https://smith.urbistat.com/api/v1/scraper_tests/{job_id}/export_xlsx" \
  -H "Authorization: Bearer sk_your_key" \
  -o test_results.xlsx

โž• Create Production Job

POST

Create Scraping Job

POST /scraper

Create a production scraping job with queue management, retry logic, and export features.

IMPORTANT: Use "auto_start": true for immediate processing (RECOMMENDED). Otherwise job stays in pending state.

Request Parameters

Parameter Type Default Description
urls string - Comma-separated URLs (for JSON requests)
urls_file file - Text file with URLs (RECOMMENDED for 10+ URLs)
name string null Job name for identification
description string null Job description
method string curl-impersonate Scraping method (curl-impersonate RECOMMENDED)
threads integer 1 Parallel threads (1-10, recommended: 1-3)
mobile boolean true Use mobile user agents (RECOMMENDED)
proxy_enabled boolean true Enable proxy (ALWAYS RECOMMENDED)
proxy_id integer null Specific proxy ID (uses primary if not specified)
auto_start boolean false Start job immediately (RECOMMENDED)
debug boolean false Enable debug mode

JSON Request Example (Small Lists)

{
  "urls": "https://site1.com,https://site2.com,https://site3.com",
  "name": "My Scraping Job",
  "mobile": true,
  "threads": 2,
  "auto_start": true
}

File Upload Example (RECOMMENDED for 10+ URLs)

curl -X POST "https://smith.urbistat.com/api/v1/scraper" \
  -H "Authorization: Bearer sk_your_key" \
  -F "urls_file=@urls.txt" \
  -F "name=Bulk Scraping Job" \
  -F "description=Processing 50,000 URLs" \
  -F "mobile=true" \
  -F "threads=3" \
  -F "auto_start=true"

Response Example - Auto Started

{
  "success": true,
  "message": "Job created",
  "data": {
    "job_id": "a3bb189e-8bf9-3888-9912-ace4e6543002",
    "queue_id": "job_a3bb189e_1704467400",
    "status": "pending",
    "total_urls": 150,
    "method": "curl-impersonate",
    "threads": 2,
    "mobile": true,
    "proxy_enabled": true,
    "auto_started": true,
    "message": "Job created and queued for processing"
  }
}

๐Ÿ“‹ List Jobs

GET

List All Jobs

GET /scraper/all

Get paginated list of all jobs regardless of status.

GET

List Queued Jobs

GET /scraper/queue

Get jobs in pending/queued state.

GET

List Running Jobs

GET /scraper/running

Get currently running jobs.

GET

List Completed Jobs

GET /scraper/completed

Get completed jobs.

GET

List Failed Jobs

GET /scraper/failed

Get failed or errored jobs.

โ„น๏ธ Job Information & Statistics

GET

Get Job Info

GET /scraper/{job_id}/info

Get complete job configuration and setup details.

GET

Get Job Status

GET /scraper/{job_id}/status

Get real-time job execution status with progress information.

Response Example

{
  "success": true,
  "data": {
    "id": "b4cc289f-9cga-4999-0023-bdf5f7654113",
    "status": "running",
    "total_urls": 150,
    "processed": 45,
    "pending": 105,
    "successful": 35,
    "failed": 10,
    "last_processed_url": "https://example.com/page45",
    "can_start": false,
    "can_stop": true
  }
}
GET

Get Job Statistics

GET /scraper/{job_id}/stats

Get detailed statistics with error breakdown.

Error Types:
โ€ข 403 Forbidden: Retryable (use POST /retry)
โ€ข 404 Not Found: Final error (page doesn't exist)
โ€ข 5xx Server: Final error (server problem)
โ€ข Other: Final error (network/timeout)

Response Example

{
  "success": true,
  "data": {
    "id": "b4cc289f-9cga-4999-0023-bdf5f7654113",
    "status": "running",
    "total_urls": 150,
    "processed": 45,
    "successful": 35,
    "errors_403": 8,
    "errors_404": 1,
    "errors_5xx": 1,
    "errors_other": 0,
    "in_retry_queue": 3,
    "total_failed": 10,
    "success_rate": 77.78,
    "avg_time_per_url": 2.68
  }
}

๐ŸŽฎ Job Control

POST

Start Job

POST /scraper/{job_id}/start

Queue and start a job. Works for both pending and stopped jobs.

RECOMMENDED: Use this endpoint for both starting pending jobs AND resuming stopped jobs.
POST

Stop Job

POST /scraper/{job_id}/stop

Stop a currently running job. Worker completes current URL first, then stops. Can be resumed later with POST /start.

POST

Resume Job

DEPRECATED
POST /scraper/{job_id}/resume
DEPRECATED: This endpoint will be removed in future versions.
Use POST /start instead - it works for both pending AND stopped jobs.
POST

Retry Failed URLs

POST /scraper/{job_id}/retry

Retry all failed URLs with 403 Forbidden status.

IMPORTANT: Only 403 Forbidden errors are retried. 404 and other errors cannot be retried.

Response Example

{
  "success": true,
  "message": "Retry initiated",
  "data": {
    "job_id": "b4cc289f-9cga-4999-0023-bdf5f7654113",
    "retried_urls": 8,
    "retry_details": {
      "error_403": 8,
      "note": "Only 403 Forbidden errors are retried. 404 and other errors are final."
    }
  }
}
POST

Cancel Job

POST /scraper/{job_id}/cancel

Cancel a job. Unlike stop, a cancelled job cannot be resumed.

DELETE

Delete Job

DELETE /scraper/{job_id}

Delete job and all associated data permanently.

WARNING: This action is IRREVERSIBLE! All job data will be permanently deleted.

๐Ÿ”— URL Management

POST

Add Single URL

POST /scraper/{job_id}/add_url

Request Example

{
  "url": "https://newsite.com/page"
}
POST

Remove Single URL

POST /scraper/{job_id}/remove_url

Request Example

{
  "url": "https://example.com/remove"
}
POST

Bulk Add URLs

POST /scraper/{job_id}/bulk_add_urls

Request Example

{
  "urls": [
    "https://site1.com",
    "https://site2.com",
    "https://site3.com"
  ]
}
POST

Bulk Remove URLs

POST /scraper/{job_id}/bulk_remove_urls

Request Example

{
  "urls": [
    "https://remove1.com",
    "https://remove2.com"
  ]
}

๐Ÿ“Š Get Results

GET

Get All Results

GET /scraper/{job_id}/results

Get paginated list of all URLs with their status and results.

GET

Get Successful URLs

GET /scraper/{job_id}/successful

Get paginated list of successfully scraped URLs.

GET

Get Failed URLs

GET /scraper/{job_id}/errors

Get paginated list of failed URLs with error details.

๐Ÿ’พ Export Data (XLSX Only)

Two-Step Export Process

Step 1: Call export_xlsx to generate file and get download URL
Step 2: Download file using the URL (no auth required, expires in 60 minutes)

GET

Step 1: Generate Export

GET /scraper/{job_id}/export_xlsx

Generate Excel export of all successful results. Returns download URL.

Response Example

{
  "success": true,
  "message": "Export generated",
  "data": {
    "job_id": "a3bb189e-8bf9-3888-9912-ace4e6543002",
    "export_type": "xlsx",
    "download_url": "https://smith.urbistat.com/api/v1/scraper/download/abc123def456",
    "download_token": "abc123def456",
    "file_size": "15.2 MB",
    "expires_at": "2024-01-15T11:30:00Z",
    "expires_in": "60 minutes"
  }
}
GET

Step 2: Download File

GET /scraper/download/{token}

Download previously generated export file using token.

No authentication required for download (token-based access)

cURL Example

# Step 1: Generate export
curl -X GET "https://smith.urbistat.com/api/v1/scraper/{job_id}/export_xlsx" \
  -H "Authorization: Bearer sk_your_key"

# Step 2: Download file (no auth required)
curl "https://smith.urbistat.com/api/v1/scraper/download/abc123" \
  -o results.xlsx
Note: Download links expire after 60 minutes. Generate a new export if expired.

๐Ÿ“‹ Export File Contents

Excel files include the following columns:

  • URL - Target URL
  • HTML BODY - Full HTML content
  • STATUS_CODE - HTTP response code
  • SCRAPED_AT - Timestamp
Features:
  • Streaming export for large datasets (100k+ rows)
  • Full HTML body content preserved
  • UTF-8 encoding for international characters
  • Memory efficient processing