The run sheet comes before proxy config
A directory run looks healthy until the export gets sorted. The same agency, clinic, or law office shows up under three category URLs, two city filters, and one legacy profile path. Nothing is wrong in the proxy logs. The crawler kept treating every URL variant as a new profile.
Before touching concurrency or route mode, write down the real target: allowed path patterns, approved fields, stop rule, and dedupe key. In one directory the dedupe key is a normalized profile URL. In another it is host plus profile ID, because category paths are noisy. The rule is the same in both cases: prove a page is new before sending the next request.
Route mode matches the session shape, not the target name
Broad public listing pages can rotate freely. A search-to-profile walk is easier with a sticky session, because the route stays steady while the crawler moves through one public flow. On a Proxynade pool, stickiness lives in the expanded username: add lifetime-<minutes> to hold the exit for that long. A flow that needs twenty minutes on one exit uses lifetime-20. Datacenter lines skip the lifetime token.
Proxy type follows the same logic. If the directory is open and forgiving, datacenter is the first cheap test. Volume Residential ($0.89/GB) covers regional visibility or reputation requirements. Premium Residential ($5.00/GB) is for the slice of the run that is actually burning retries. Static ISP applies only when a stable public route is part of the design — it is pay-per-IP with unlimited bandwidth, so the math only works for long-running fixed-exit tasks.
| Session pattern | Route mode | Proxy type to try first |
|---|---|---|
| Public listing pages, full rotation | Rotating (no lifetime token) | Datacenter or Volume Residential |
| Search-to-profile walk | Sticky (lifetime-20 or similar) | Volume Residential |
| High-retry segment | Sticky or rotating depending on target | Premium Residential |
| Fixed public route, long-running | Static | Static ISP |
App counters and proxy meters count different things
The application counter tracks accepted profiles and maybe accepted pages. Cost per profile looks tidy. The proxy meter also counts redirects, filtered profiles, blocked pages, duplicate pages, retries, map tiles, image CDNs, and trackers. The dashboard network logs show host, outcome, latency, and byte totals per request. Usage logs export as CSV.
When people say the app is lying, this is the gap they mean. Compare both after each batch, not at the end of the run.
A domain blocklist saves more bandwidth than raising rate limits
If the research never uses map tiles, tracking hosts, widget scripts, or image galleries, block those hosts before raising concurrency. A crawler can behave politely on requests per second while spending most of each page load on files the parser discards.
Build the blocklist from the first batch. Export the usage log CSV, sort by bytes, and drop the hosts that never appear in accepted records. That list does not change much between runs on the same directory.
Public scope is a run-sheet decision, not a proxy setting
Logins, private profiles, private messages, paywalls, and robots restrictions are boundaries on the run sheet, not routing problems. If a directory marks a section as not available for automated collection, that section stays off the path list. No proxy configuration changes what the run is permitted to collect.
What to keep in the run folder
- Run sheet: target host, allowed path patterns, approved fields, stop rule.
- Dedupe key and deduplication method.
- Route mode and proxy type used.
- Rate limit set at the start.
- App bytes and proxy meter bytes after each batch.
- Accepted profile count per batch.
If numbers look odd after the run, that folder is enough to see whether the crawler drifted, the directory changed structure, or the proxy layer carried exactly the work it was given.