Scraper on a Dedicated Server: Process, Proxy, Disk

Move only after the small host is the named bottleneck

A dedicated server earns its cost when you can point at the specific failure: Chromium contexts getting OOM-killed, workers running into swap, logs filling the disk, or queue timing varying 3x between runs on the same VPS. Until the host itself is the variable, the bottleneck is somewhere else.

Prove the parser on a small batch before scaling anything. If the target blocks the job after ten requests, a larger machine gives the bad run more CPU and more disk to waste. Use allowed pages, public APIs, RSS feeds, data exports, or partner access first.

Do not tune the machine and the proxy route at the same time. Change one variable, keep the log, then decide whether the bottleneck was CPU, memory, network, or the exit.

Browser workers hit memory before CPU

Requests-only crawlers (see Python Requests proxy docs) can be cheap and boring on dedicated hardware. Playwright and Chromium are less forgiving. Each browser context carries memory overhead, failed runs leave profile directories behind, and a single heavy page can stall worker timing for the whole pool.

On a 16 GB host, start with four Playwright contexts, not twenty. Watch RAM, swap, open-file count, CPU, queue depth, and accepted rows together. Raise the worker count only if all of those stay sane, not just the CPU graph. A blank benchmark page tells you almost nothing about the actual target.

Signal	What it means	Action
Swap in use	Worker count exceeds available RSS	Cut workers, not page timeout
Open files near ulimit	Browser profiles not cleaning up	Close contexts explicitly; add profile cleanup on exit
Queue depth rising	Workers slower than intake rate	Check per-page latency before adding workers
CPU idle, throughput flat	I/O or network is the wall, not CPU	Profile at the network level before adding cores

Proxy credentials stay out of logs and source

Put the proxy string in an environment file or a secret store, not in source code, crash output, or screenshots. On a Proxynade pool, the expanded username carries routing in the credential itself: base user plus plan token (volume, premium, or datacenter), an optional country code, and an optional lifetime-<minutes> rotation window. A 30-minute sticky session for a browser job uses a username like rt97db6958d9-plan-premium-country-us-lifetime-30; the gateway is http://proxynade.net:2555.

Name the environment variable after the job class, not the credential. A failed run can then log which route it used without printing the full string. The Proxynade dashboard network logs show host, outcome, latency, and byte totals; export usage as CSV for billing reconciliation.

Bytes per accepted row is the number worth tracking. A worker dashboard may report 80,000 pages processed while the useful count is 19,000 accepted rows after duplicates, soft blocks, empty pages, retries, and parser rejects. The proxy meter counts redirects, failed TLS attempts, images, and scripts that the app never credits. Both numbers are real; the gap between them is where the job cost lives.

Supervise workers so failures surface at 2 a.m.

Run workers under systemd, PM2, or supervisord, whichever your team already knows how to restart at 2 a.m. The tool matters less than what it logs. Before any overnight run, confirm these are visible somewhere: queue depth, worker count, exit code, last target host, proxy route label, accepted rows, rejected rows, retry reason, and disk free space.

A non-root worker user, pinned language dependencies, and a bootstrap script you can rerun on a clean host are the baseline. Package names, Node versions, and Playwright browser installs change; the runbook belongs with the scraper, not in a blog post.

Disk fills silently while CPU looks healthy

Logs, screenshots, HTML snapshots, failed-response bodies, downloaded assets, and browser profile directories accumulate while the CPU graph stays flat. Rotate logs before launch. Cap screenshot and raw HTML capture. If you need evidence for parser bugs, sample failed pages rather than saving every broken response forever.

Set a disk-space alert at a threshold that gives you time to act, not a post-mortem. The jobs that look fine at midnight and ruin the morning are almost always disk, not compute.

Dedicated server scraper FAQ

When does a scraper justify a dedicated server? When the host is the named bottleneck: Chromium contexts being OOM-killed, workers hitting swap, or disk filling mid-run. Not before.

How many browser contexts can a dedicated server run? Depends on the page, not the spec sheet. Start with four Playwright contexts on a 16 GB host, watch RAM, swap, and open-file counts, then raise the limit if headroom stays.

Where should proxy credentials go on a server? In an environment file or secret store, not in source code, log output, or crash dumps. The scraper can log a route label; it should not log the credential string.

Why does disk fill up even when CPU looks fine? Logs, screenshots, HTML snapshots, failed-response bodies, and browser profile directories accumulate silently. Rotate logs before the job starts and cap raw HTML captures.

What should a worker log at minimum? Queue depth, worker count, exit code, last target host, proxy route label, accepted rows, rejected rows, retry reason, and disk free space.

Pre-launch checklist

Verify the job on a small batch before running overnight.
Proxy credentials in env vars only, never in source or logs.
Workers running under a supervisor with restart-on-exit.
Log rotation configured before first run.
Disk alert set above zero.
Worker count proven against real target pages, not a blank benchmark.

How to run a scraper on a dedicated server