Session Quality and Scrape Results: 200s That Lie

200 is not the right success metric

I had a listing scrape where the session looked healthy by the wrong measure. Same target list, same parser, same schedule. Most responses were 200, latency was not alarming, and the worker queue was calm. I spent forty minutes diffing saved HTML against the Playwright trace before realizing the parser was not the problem.

The saved pages were broken in small ways: right shell, no listings. CAPTCHA bodies with normal-looking status codes. Duplicate rows because the target served the same fallback page for different listing URLs. There was no dramatic proxy outage, just sessions that connected and produced nothing usable.

This is a trimmed version of that log:

14:12:08 w3 sess=rot host=target-a url=/search?...p=4 status=200 len=11840 grid=0 trace=plw_7f2 note=shell
14:12:11 w1 sess=rot host=target-a url=/search?...p=5 status=200 len=11902 grid=0 trace=plw_7f9 note=same shell
14:12:14 w2 sess=rot host=target-a url=/item/8841 status=200 len=46211 fields=ok bytes=48902
14:17:42 w3 sess=sticky-a7 host=target-a url=/search?...p=6 status=200 len=58764 grid=42 note=ok-ish

The useful signal was the mismatch between grid=0 on rotating lines and grid=42 on a sticky line pointing at the same search URL, not the status code. A session is good when it returns the expected page, inside the collector's timing tolerance, with required fields present. A 200 OK with no grid rows is a failed scrape.

Log one row per request with all the fields that matter

When the proxy dashboard, scraper output, and metrics tab are separate, people guess. A single log row per request that carries host, plan, session mode, HTTP status, latency, bytes, accepted fields, and CAPTCHA flag makes the source of the problem visible without cross-referencing three tools.

The fields that matter most are the ones that survive joining the proxy usage log against scraper output: host, plan, protocol, outcome, HTTP status, latency, bytes billed, accepted record count, duplicate rate, and CAPTCHA rate. If those are in one place, the question of whether the problem is routing, parser logic, target behavior, worker pressure, or geography usually answers itself.

Rotating is a cheap first test, not a permanent answer

Rotating changes the exit IP every request. It is correct for broad independent URL lists and cheap to set up. For repeated checks on one permitted target, sticky sessions are less weird: the target sees the same IP across a sequence and the fallback-page rate drops. For multi-step flows where state must persist across requests, hard sessions are required.

In that listing run, switching to sticky did not improve speed. It reduced the rate of shell pages and duplicates, which was what mattered. Sticky returned more usable rows, even though it was no quicker.

Mode	When to use	Main failure pattern
Rotating	Broad independent URL lists	Fallback pages, duplicate rows when target flags churn
Sticky	Repeated checks on one target, short flows	Session expiry mid-flow; timer must match `lifetime-<minutes>`
Hard	Multi-step flows where state matters	IP dies mid-flow; need retry at a process boundary

On Proxynade's pool, the session mode lives in the username. A sticky assignment uses a lifetime-<minutes> token in the expanded username. For example, rt97db6958d9-plan-volume-country-us-lifetime-30 holds the same exit for 30 minutes. Rotating omits that token. On datacenter, lifetime works only on a sticky assignment.

A fixed test set isolates one variable at a time

Four URLs is enough for a session mode comparison: one simple page, one listing page, one detail page, one known-missing page. Run the same four across rotating and sticky with the same parser version and the same worker shape. If the session mode and the parser change at the same time, the result is ambiguous.

Protocol belongs in the same test. HTTP, HTTPS, and SOCKS5 are available through the pool, and SOCKS5 operates at the transport layer as defined in RFC 1928. The client library still matters, since a stack that behaves cleanly over HTTPS can fail differently under SOCKS5. Confirm the library behavior before blaming the pool.

Blocked waste domains reduce noise in the byte count

Analytics endpoints, ad pixels, tracking calls, preview media, and accidental CDN side-loads inflate byte counts and latency without contributing to scrape output. Block them at the request level before comparing plans. Blocked rows should show a blocked outcome and zero bytes in the usage log.

One caution: some targets embed real JSON inside paths that look like CDN assets. If blocking changes accepted field count, remove the rule for that path.

Plan choice follows accepted rows per GB, not plan name

Volume Residential is $0.89/GB. Premium Residential is $5.00/GB. Premium adds ISP, ASN, and adblock routing on top of country, region, and city. The right question is which plan delivers more accepted rows per GB after retries, duplicates, CAPTCHA pages, and thin 200s are counted, not which plan sounds stronger.

Export the usage log by host from the Proxynade dashboard and put it next to the scraper's accepted-row count. Sometimes Volume produces the same accepted records for a fifth of the cost, and the answer is to leave the fancier route alone.

Session quality FAQ

What is session quality in scraping? A session is good when it returns the expected page from the expected host within the collector's timeout, with required fields present. A TCP connect or a 200 status is not enough if the page is thin, challenged, redirected, or missing fields.

When should I use sticky sessions instead of rotating? Use sticky for multi-step flows where the target must see the same IP across requests. Rotating works for broad independent URL lists. The difference comes down to state consistency, not speed.

How do I measure accepted rows per GB? Export the proxy usage log as CSV and join it to the scraper's output. Divide accepted record count by billed gigabytes after subtracting blocked-domain bytes. That ratio lets you compare session modes and plans on the same job.

What does a 200 with grid=0 mean? The proxy connected and the target responded, but the page returned no listing data. That is usually a shell page, a CAPTCHA body, or a fallback page the target serves for sessions it has flagged. The session quality problem is real even though the status code looks clean.

When is Premium Residential worth the cost over Volume? When accepted rows per GB after retries and duplicate filtering is higher on Premium than on Volume. If the ratio is the same, Volume at $0.89/GB is the correct answer. Run the comparison on the same target before switching.

Checks before calling a session good

Status code and accepted field count are both logged per request.
Duplicate row rate is measured, not assumed zero.
Blocking rules are tested against required field coverage.
Session mode matches the flow type: rotating for broad, sticky for repeated, hard for stateful.
Plan choice is validated by accepted rows per GB on the actual target, not by plan tier.

Session quality and scrape results: 200s that lie