Scrapy Proxy Rotation: Gateway Setup, Middleware, 407 Debugging

Q: Where does the proxy URL go in Scrapy?

Set request.meta["proxy"] inside a downloader middleware. Keep the URL in an environment variable, not in spider code.

Gateway rotation, not list rotation

Provider-side rotation means the spider talks to one fixed gateway URL and the provider changes the exit IP behind it. That is different from keeping a list of proxy hosts in a file and having Scrapy pick one before each request. With a gateway, the proxy URL in request.meta["proxy"] never changes; the exit changes on the provider side.

On a Proxynade pool, the exit behavior lives in the username. A rotating line uses an expanded username such as rt97db6958d9-plan-volume-country-us: base username, required plan token (volume, premium, or datacenter), optional country code. To hold one exit across a short flow, add lifetime-<minutes> to the username, for example rt97db6958d9-plan-volume-country-us-lifetime-10. Datacenter takes a lifetime token only when the session is sticky.

The most common first bug is using a sticky endpoint when you expected a rotating one. The crawl starts without error, the IP echo comes back the same five times in a row, and the problem is the URL in the environment variable, not Scrapy.

Attach the proxy in middleware, not in the spider

Keep credentials out of spider code. Put the proxy URL in an environment variable or deployment config, then let Scrapy's HttpProxyMiddleware attach it to each request. That lets different jobs use different routes without touching spider logic.

# settings.py
import os

SCRAPY_PROXY_URL = os.environ["SCRAPY_PROXY_URL"]
DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.ProxyMiddleware": 350,
}
DOWNLOAD_TIMEOUT = 30
RETRY_HTTP_CODES = [403, 407, 429, 500, 502, 503, 504]

# middlewares.py
class ProxyMiddleware:
    def process_request(self, request, spider):
        proxy_url = spider.settings.get("SCRAPY_PROXY_URL")
        if proxy_url:
            request.meta["proxy"] = proxy_url

The proxy URL for a Proxynade gateway is http://proxynade.net:2555. Scrapy embeds credentials in the URL with the standard http://user:pass@host:port form, or you can set request.meta["proxy_user"] and request.meta["proxy_pass"] separately to keep the password out of the URL string.

An http:// proxy URL is still correct for HTTPS targets. Scrapy sends a CONNECT request to the proxy, which tunnels TLS to the target site. There is no separate HTTPS proxy. Switch to SOCKS only when your downloader setup explicitly supports it and you need DNS to resolve through the proxy; otherwise it adds one more moving part for no gain.

Separate 407 from target blocking

A 407 Proxy Authentication Required comes from the proxy, not the target. A 403 or an empty result page comes from the target. Keeping those two sources separate saves more debug time than anything else in this list.

Result	Source	Next check
`407`	Proxy rejected credentials	Check username, password, account balance, and for invisible whitespace.
Connection timeout	Proxy route or target	Try an IP echo endpoint before loading the real target.
`403`	Target refused the request	Proxy connected. Change headers, pacing, or request shape.
Same IP on every request	Sticky endpoint in use	Re-check the proxy URL in the environment variable.

The Scrapy log for a 407 retry looks like this:

[scrapy.core.engine] DEBUG: Crawled (407) <GET https://example.com/> (referer: None)
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://example.com/> (failed 1 times): 407 Proxy Authentication Required

If 407 appears on every retry, the credentials are wrong. Paste the URL into a curl command and confirm the response code before touching the spider.

Test order before scaling

Run three checks before setting concurrency above one. First, an IP echo endpoint such as httpbin.org/ip confirms the route is being used and the exit is changing. Second, the real target at low concurrency tells you whether the route survives the page you actually care about. Third, one request with a deliberately wrong password gives you a known 407 to compare against later.

Name proxy settings by job and target, not by number. CATALOG_US_PROXY_URL is easier to debug than PROXY_URL_3 when bandwidth jumps and nobody remembers which spider owned that route.

Keep retries capped while tuning. A retry is another proxy request and another request against the target. If retries climb faster than accepted rows, stop and inspect response bodies before raising concurrency.

Provider bytes do not match Scrapy counters

Scrapy counters reflect what spider callbacks received. Provider usage logs count everything at the TCP level: redirect chains, retries, CONNECT overhead, failed responses, and response bodies that pipelines downloaded and discarded. The gap is normal, and it widens when retries are high or when item pipelines download large files.

The Proxynade dashboard network logs show host, outcome, latency, and byte totals per request. Usage logs export as CSV. If bytes per accepted item looks wrong, pull the CSV, filter on your spider's target host, and check how many rows are retries or non-200 outcomes before buying more bandwidth.

Start the real run at low concurrency: one target, one proxy route, Scrapy logs open next to the provider usage view. Once saved rows rise faster than retries and bytes stabilize, increase workers. If rotating mode keeps returning the same IP, re-check the proxy URL. If everything is 407, copy the URL again and look for bad credentials or invisible whitespace before changing the spider.

Rotating mode has limits

Rotating exits do not make Scrapy look like a browser. If a target cares about TLS fingerprints, user-agent headers, JavaScript execution, or cookie state, changing the exit IP will not help. Rotating mode fits public product pages, listing pages, documentation, search result pages, and other stateless read-only requests where each request can stand on its own.

Sticky mode, set by a fixed lifetime in the username, is for short flows where a few consecutive requests need the same exit because cookies or server-side session state tie the responses to one IP. If the target offers an official API, feed, sitemap, or partner export route, use that instead of either mode.

Scrapy proxy FAQ

Where does the proxy URL go in Scrapy? Set request.meta["proxy"] inside a downloader middleware. Keep the URL in an environment variable, not in spider code.

What does a 407 in Scrapy mean? 407 comes from the proxy, not the target. Check the username, password, account balance, and for invisible whitespace in credentials.

Why does rotating mode keep returning the same IP? The proxy URL likely points at a sticky endpoint. Check the URL in the environment variable against what the provider dashboard shows as the rotating line.

Do provider byte counts match Scrapy counters? No. Providers count redirects, retries, CONNECT overhead, and failed responses. Scrapy counters reflect only what spider callbacks received.

When should rotating mode be replaced with sticky mode? Use sticky mode when a short flow needs the same exit across several requests, such as a paginated session where server-side state ties responses to a single IP.

Production checks

Keep credentials out of spider code and CLI history.
Name proxy environment variables by job and target.
Confirm exit changes with an IP echo before the real run.
Cap retries while tuning; inspect bodies before raising concurrency.
Compare provider CSV export against Scrapy counters when bytes look wrong.

How to rotate proxies in Scrapy