Log the row before you write any alert logic
This came out of a boring bug. I spent most of an afternoon chasing what looked like a massive competitor discount in Spain. The page returned 200, the product title was there, the stock badge was there, and the price block was empty. The first parser treated that blank as a zero-price hit. The alert looked urgent until we opened the saved HTML.
The accepted record now keeps: current price, previous price, currency, region, SKU or product ID, seller, timestamp, source URL, and crawl status. If a Spanish URL silently redirects to a US page, the row says that. Yesterday's good EUR price does not get overwritten by whatever the redirect returned.
Discovery and refresh are different jobs
One worker trying to do both runs badly. Discovery wanders through product URLs, seller pages, variants, and competitor coverage. Refresh visits known URLs and should be boring. Mixing them makes the crawler bursty and the bandwidth bill harder to audit, because half the requests are not price checks.
The proxy line comes from the dashboard
Pull it fresh each time rather than copying from an old Slack paste. The connection string follows the standard Python Requests proxy URL format.
http://USERNAME:PASSWORD@proxynade.net:2555
The username carries routing options. A Volume Residential request targeting Spain looks like rt97db6958d9-plan-volume-country-es-lifetime-30: the base token, the plan (volume, premium, or datacenter), an optional lowercase country code, and an optional rotation window in minutes. Datacenter lines omit the lifetime token.
For each fetch, log target host, plan, protocol, status, latency, and bytes. When one currency check pass showed ES pages bouncing to USD and one 200 response was only 14 KB, the logs made the problem obvious without re-running the crawl. The app counted 31 MB for that run; the dashboard showed 39 MB. That gap was enough to make cost-per-row look different from what the spreadsheet said.
Crawl status handles empty and wrong-currency responses
Empty price blocks, wrong currency, and geo mismatches get a distinct crawl status rather than overwriting the last good price. The previous value stays in place until a clean fetch replaces it.
| Status | Condition | Action |
|---|---|---|
ok | Price and currency parsed, geo matches | Write row |
empty_price | 200 but price block absent or blank | Keep previous, flag for retry |
currency_mismatch | EUR requested, USD returned | Keep previous, check geo route |
redirect_geo | URL resolved to different region | Keep previous, fix username country code |
blocked | 403, CAPTCHA page, or near-empty body | Keep previous, step up proxy plan |
Proxy plan selection starts with the cheapest route the target tolerates
Datacenter proxies work fine for public product pages on sites that do not filter by IP reputation. Volume Residential ($0.89/GB) is the next step when a residential fingerprint matters. Premium Residential ($5.00/GB) is justified only when the cleaner route eliminates enough bad rows to offset the cost difference — calculate cost-per-accepted-row, not cost-per-request.
1,000 products across 10 competitors refreshed every 30 minutes is 480,000 attempts per day before retries. Then a browser renderer adds scripts, images, recommendations, trackers, and media. The napkin number stops being useful fast.
Block analytics, media, font, recommendation, video, and ad hosts when the parser does not need them. Then open a few saved pages manually. A page that screenshots cleanly can still produce a wrong row if the asset block cut something the parser depends on.
The app counter is not the bill
Redirects, failed pages, blocked browser loads, and discarded assets all transfer bytes. The dashboard network logs show host, outcome, latency, and byte totals; usage logs export as CSV. Track cost per accepted SKU row. Attempt counts are easy to produce and easy to mislead with.
Production checks
- Log currency and region on every row, not just price.
- Mark crawl status — never silently overwrite on empty or wrong-currency responses.
- Separate discovery workers from refresh workers.
- Pull proxy credentials from the dashboard, not shared message history.
- Compare app-level byte counts against the dashboard before scaling up.
- Choose proxy plan by cost-per-accepted-row, not cost-per-GB alone.