Directory research

Directory research with proxy controls

The dedupe key comes before concurrency. Route mode comes after that. Cost problems come from neither.

Field notes Setup checks Updated 2026-06-12

The run sheet comes before proxy config

A directory run looks healthy until the export gets sorted. The same agency, clinic, or law office shows up under three category URLs, two city filters, and one legacy profile path. Nothing is wrong in the proxy logs. The crawler kept treating every URL variant as a new profile.

Before touching concurrency or route mode, write down the real target: allowed path patterns, approved fields, stop rule, and dedupe key. In one directory the dedupe key is a normalized profile URL. In another it is host plus profile ID, because category paths are noisy. The rule is the same in both cases: prove a page is new before sending the next request.

Route mode matches the session shape, not the target name

Broad public listing pages can rotate freely. A search-to-profile walk is easier with a sticky session, because the route stays steady while the crawler moves through one public flow. On a Proxynade pool, stickiness lives in the expanded username: add lifetime-<minutes> to hold the exit for that long. A flow that needs twenty minutes on one exit uses lifetime-20. Datacenter lines skip the lifetime token.

Proxy type follows the same logic. If the directory is open and forgiving, datacenter is the first cheap test. Volume Residential ($0.89/GB) covers regional visibility or reputation requirements. Premium Residential ($5.00/GB) is for the slice of the run that is actually burning retries. Static ISP applies only when a stable public route is part of the design — it is pay-per-IP with unlimited bandwidth, so the math only works for long-running fixed-exit tasks.

Session patternRoute modeProxy type to try first
Public listing pages, full rotationRotating (no lifetime token)Datacenter or Volume Residential
Search-to-profile walkSticky (lifetime-20 or similar)Volume Residential
High-retry segmentSticky or rotating depending on targetPremium Residential
Fixed public route, long-runningStaticStatic ISP

App counters and proxy meters count different things

The application counter tracks accepted profiles and maybe accepted pages. Cost per profile looks tidy. The proxy meter also counts redirects, filtered profiles, blocked pages, duplicate pages, retries, map tiles, image CDNs, and trackers. The dashboard network logs show host, outcome, latency, and byte totals per request. Usage logs export as CSV.

When people say the app is lying, this is the gap they mean. Compare both after each batch, not at the end of the run.

A domain blocklist saves more bandwidth than raising rate limits

If the research never uses map tiles, tracking hosts, widget scripts, or image galleries, block those hosts before raising concurrency. A crawler can behave politely on requests per second while spending most of each page load on files the parser discards.

Build the blocklist from the first batch. Export the usage log CSV, sort by bytes, and drop the hosts that never appear in accepted records. That list does not change much between runs on the same directory.

Public scope is a run-sheet decision, not a proxy setting

Logins, private profiles, private messages, paywalls, and robots restrictions are boundaries on the run sheet, not routing problems. If a directory marks a section as not available for automated collection, that section stays off the path list. No proxy configuration changes what the run is permitted to collect.

What to keep in the run folder

  • Run sheet: target host, allowed path patterns, approved fields, stop rule.
  • Dedupe key and deduplication method.
  • Route mode and proxy type used.
  • Rate limit set at the start.
  • App bytes and proxy meter bytes after each batch.
  • Accepted profile count per batch.

If numbers look odd after the run, that folder is enough to see whether the crawler drifted, the directory changed structure, or the proxy layer carried exactly the work it was given.