Collect Public Data with Measured Proxy Usage: Cost per Record

The app counter and the proxy meter count different things

The collector saved enough rows, the parser did not crash, and nobody notices a problem until the proxy usage looks too high for what landed in storage. Most of the waste is redirects, soft blocks, challenge pages, auth misses, and retries that the parser discards. The number that matters after a run is not "requests sent" — it is what the proxy traffic cost for the records that actually landed.

Start with one route. If the result is bad, scaling only hides which part failed and makes the provider meter harder to explain.

Apps undercount because they measure what the app cared about: saved rows, parsed bodies, successful responses. The proxy meter counts what the provider carried: redirects, retries, block pages, failed attempts, auth misses, challenge bodies, edge bytes. A crawler can report 50,000 saved rows while the Usage Logs show that a large share of the bytes never became usable data.

Join two exports, not a new dashboard project

Export two files and put them next to each other. The collector export needs run_id, host, started_at, finished_at, and saved_records. The Usage Logs export from the dashboard gives the proxy side: timestamp, host, plan, protocol, outcome, status, latency, and provider-metered bytes.

The dashboard does not need to know your run_id. If the collector has a clean time window, host plus time is enough to join the two sides and surface expensive domains. Filter the usage CSV to the run window first — joining a full month of proxy rows is unnecessary.

select
  c.run_id,
  c.host,
  sum(u.bytes) as proxy_bytes,
  sum(c.saved_records) as saved_records
from collector_runs c
join usage_export u
  on u.host = c.host
 and u.timestamp between c.started_at and c.finished_at
where u.timestamp >= '2026-04-26T10:00:00Z'
  and u.timestamp <  '2026-04-26T10:30:00Z'
group by c.run_id, c.host;

A row worth stopping on: run=catalog-0426-a host=example.com plan=Volume status=403 bytes=612KB saved=0. One row is noise. Thousands of them is a billing problem — a 403 Forbidden response still costs bytes even though no usable record came back. That same review can show when Premium Residential is worth testing for a difficult host, when Datacenter is fine for a plain public endpoint, or when the fix is just cutting retry depth.

Price per host before arguing about the run total

Volume Residential is $0.89/GB. If one host burns 1.2 GB to save 80 records, that host costs about 1.5 cents per kept record. Another host burns 300 MB and saves 600 records — roughly a tenth of that. Same plan, same collector, very different economics. Domain rollups make that difference visible before it buries itself in a monthly total.

Run this audit when a host starts eating balance, when a retry change looks innocent but the meter disagrees, or when the app reports a cheap job and the provider says otherwise. The fix is often moving one host to Premium Residential, moving another to Datacenter, or doing the least exciting thing: lowering retries and testing again.

Reference: RFC 9309 — Robots Exclusion Protocol.

Collect public data with measured proxy usage

The app counter and the proxy meter count different things

Join two exports, not a new dashboard project

Price per host before arguing about the run total