Testing Methodology

v2.3.4

Every score on ProxyStats is generated by automated, reproducible tests run from our own infrastructure across multiple geographic regions. No provider pays for placement; rankings are based purely on performance data. See Limitations & Disclosures for what this benchmark can and cannot tell you.

Scoring Formula

Each provider receives a composite score from 0 to 100, calculated as a weighted sum of five independently measured dimensions. The weights below are a design choice. They reflect "general purpose" scraping use. For use-case-specific rankings (Google SERP, Amazon, Social Media, etc.) with re-weighted formulas, see our /best/ pages.

score_v2.1 = core_network × 0.35 + session_reliability × 0.30
+ neutral_reach × 0.15 + target_reach × 0.20

v2.2 (2026-06-26): geo measurement now ships — controlled US + DE exits with vantage↔geo pairing, headline blended across regions. geo_integrity is measured (requested vs. observed country) but stays at 0% weight until trusted. Full methodology changelog below.

Core Network Performance

Plane A35%

Measures fundamental proxy connectivity: TCP connect time, TLS handshake duration, time to first byte (TTFB), and timeout rate. Probes run every 15 minutes. As of v2.3 (2026-06-29), the headline latency (TTFB) is measured to the NEAREST Cloudflare edge (the proxy's closest CDN on-ramp), so it isolates the proxy's own network quality and regionalizes per vantage instead of every proxy reaching one fixed server. This is the same clean network metric independent benchmarks like Proxyway use; good residential reads sub-second, mid-tier 0.8–2s. As of v2.1, this score uses ONLY Plane A (controlled-plane) data.

Metrics: connect_p50, connect_p95, tls_p50, tls_p95, ttfb_p50, ttfb_p95, timeout_rate, controlled-plane success_rate.

Session Reliability

Plane A30%

Tests how well sticky sessions maintain the same exit IP over time. We open a session, make 2–4 requests over the session TTL (60s in current version; longer TTLs ship in Phase 43.7), and check if the IP rotates unexpectedly. High unexpected rotation indicates an unstable or overloaded IP pool.

Metrics: session_survival_rate, unexpected_rotation_rate. Heavy penalty (×3) for unexpected mid-session IP changes; score floors at 0 when rotation ≥ 33%. burst_failure_rate parameter retained but not measured until Phase 43.7.

Geo Integrity

Plane A0% (measured, not yet scored)

As of v2.2 (2026-06-26) this is MEASURED again, but deliberately kept at 0% weight. We pair each test vantage to a controlled exit country (US vantage → US exit, DE vantage → DE exit) and compare the requested exit country against the country actually observed at the exit IP. Current match rate is 100% on geo-paired providers. We keep it out of the composite until we trust it across more providers and a longer window — an unproven metric should not move rankings. Per-region performance (latency / success / session) is the practical payoff and already feeds the blended score.

Measured now: requested-vs-observed country match per region (US + DE). Planned: more exit countries (UK/PL/…), city & ASN consistency, then a non-zero composite weight once the data is trusted.

Neutral Reachability

Plane B15%

Tests ability to reach neutral third-party services like httpbin.org and Cloudflare speed test. These targets don't actively block proxies, so failures here indicate fundamental connectivity issues rather than anti-bot detection. Weight raised from 10% to 15% in v2.1 (redistribution after geo removal).

Metrics: neutral_success_rate, neutral_avg_ttfb_ms. Bonus for TTFB < 500ms, penalty for > 2000ms.

Target Reachability

Plane C20%

Periodically samples high-defense targets (Google, Amazon) with strict safety rules. As of 2026-05-27, response bodies are scanned for CAPTCHA / challenge / interstitial markers; an HTTP 200 with a soft-block page is classified as challenge_or_interstitial, not success. Weight raised from 10% to 20% in v2.1, since this is the most user-relevant dimension for anti-bot-protected scraping.

Metrics: target_deny_rate, target_challenge_rate. Runs every 6 hours with 24h per-IP cooldowns. ~120 samples/month/provider. Body inspected for ~15 known anti-bot markers; only first 50 KB scanned; body content never stored or logged.

Star Rating Scale

Badges and provider pages show a 5-star rating derived from the trailing 30-day average composite score through the published bands below (linear between anchors). Our composite is a deliberately harsh scale — a top provider lands in the low 80s, not at 100 — so a naive score÷20 would misread excellent performance as mediocre. The bands recalibrate the display without ever changing provider order; the exact composite score is always shown alongside.

Composite	40	50	60	70	80	90	100
Stars	2.5	3.2	3.8	4.2	4.6	4.9	5.0

New providers: with under ~two weeks of data the rating is pulled toward a neutral 3.8★ (IMDb-style weighted rating, m=14 days) and converges to the earned value as history accumulates — so early noise never shows as a confident rating.
Dimension gate: ratings of 4.5★ and above additionally require every composite dimension (core network, session reliability, neutral and target reachability, 30-day averages) to clear 60/100 — one failing dimension caps the display at 4.4★.
Stars are display only: rankings always sort by the composite score, never by the star rating.

Three-Plane Testing Architecture

Unlike traditional benchmarks that hammer target sites directly, we use a layered approach that separates controlled testing from real-world reachability measurement.

Controlled

Every 15 minutes

Targets: Our own endpoints (/healthz, /reflect, /tiny, /download)

Purpose: Core latency, TLS timing, session tests, geo verification

Neutral

Every 15 minutes

Targets: httpbin.org, Cloudflare speed test

Purpose: External reachability without anti-bot interference

High-Defense

Every 6 hours

Targets: Google, Amazon

Purpose: Anti-bot bypass capability (safe sampling only)

Test Infrastructure

Vantage points	2 probe regions: EU-Central (Frankfurt) + US-East. Storage on a separate main server (no probe traffic).
Plane A frequency	Every 15 minutes (~96 cycles/day); probes write immediately to probe_runs table
Plane C frequency	Every 6 hours with 24h per-IP cooldown (stop-after-deny)
Composite recompute	Once daily via rollup task at 00:05 UTC, so dashboard values change once per day
Worker isolation	Dedicated Docker containers with egress-only networking
SSRF protection	DNS pre-resolution, CIDR blocklisting, network segmentation
Result integrity	HMAC-SHA256 on all numeric metrics
Data retention	90 days of granular probe data, daily rollup aggregation
Confidence levels	Low (<10 probes), Medium (10–50), High (>50 probes/day)

Data Pipeline

probe_runs

→

daily rollup

→

leaderboard

Raw probe results are stored individually as they arrive (every 15 minutes for Plane A/B, every 6 hours for Plane C). The rollup task runs once daily at 00:05 UTC and aggregates the last 24 hours of probes per provider per region into one row in rollups_daily. The composite score on the leaderboard is recomputed at rollup time, so dashboard values change once per day, not every 15 minutes. Each rollup includes a confidence badge based on probe count.

Limitations & Disclosures

What this benchmark can and cannot tell you. We publish this section explicitly because most "best proxy" lists hide their methodology gaps.

⚠

Vantage point bias. Probes originate from EU-Central (Frankfurt) and US-East. We do not test Asia or LATAM exit performance, so a provider with a strong APAC pool may underperform on our metrics through no fault of their own.

⚠

Limited provider coverage. We currently benchmark 5 active providers (Maskify, Aceproxies, FleetProxy, KindProxy, Proxyon), plus GonzoProxy shown with paused historical data. Larger providers (Bright Data, Oxylabs, Smartproxy) are not yet tested, pending business relationship / budget. We do not extrapolate scores to untested providers, and a provider's first days carry a low-confidence badge until data accumulates.

⚠

Plane C sample size. Anti-bot target probes run every 6 hours = ~120 samples per provider per month. Single-digit percentage differences in Google success rate may be statistical noise. Treat the ranking as more informative than the absolute number.

⚠

HTTP 200 ≠ usable success (now mitigated). As of 2026-05-27, Plane C responses are scanned for CAPTCHA / challenge / interstitial markers; soft-blocked HTTP 200 responses are classified as challenge_or_interstitial and excluded from success rate. Marker list is conservative, so some subtle anti-bot variants may still slip through. Pre-2026-05-27 rollups used the old definition; comparing across that date is not apples-to-apples.

⚠

Session TTL coverage. Session reliability tests use a 60-second TTL window. Real-world session lifetimes (5–30 minutes for login flows) are not yet directly tested. Longer-TTL probes are planned.

⚠

IP uniqueness window. Uniqueness ratios are computed within the daily rollup window. A pool that rotates every few hours may show high uniqueness in our metric but have a smaller effective pool than the number suggests.

⚠

Composite weights are subjective. The 35/30/15/20 split (v2.1) reflects our judgment for "general purpose" use, not a universal optimum. If your workflow is SERP-heavy, use the Google-specific score on the /best/ pages instead.

What we explicitly do NOT test. We do not probe LinkedIn, X, Instagram, Booking, Expedia, Zillow, Realtor, or Redfin. These targets prohibit automated access in their ToS, and concentrated probing from our infrastructure could fingerprint and flag every provider we test simultaneously (JA4 fingerprint cascade). See our Path C commitment for the full rationale.

Score Methodology Changelog

Public versioning of every change to the scoring formula. We publish these the day a change ships, with rationale and impact. If you ever wonder "why did this score move?", this is the answer.

v2.3.42026-07-17current

We added a rotation-depthreference metric. From our own periodic testing we sample how many unique exit IPs a provider's pool yields over a fixed request budget, measured the same way for every provider. It answers a question the composite could not see: at benchmark probe volume a shallow pool never runs out, so a large-scale user's exhaustion risk stayed invisible in the score.

→
Reference only.Zero weight in the global composite, so provider rankings do not move. It surfaces on provider pages as "Rotation Depth" with a large-scale suitability flag, and it is the only input that factors the large-scale web-crawling use-case ranking, where pool depth genuinely is quality.
→
Aggregate only.We store counts of unique IPs and /24 subnets, never IP identities, and we never claim a provider's absolute pool size — only what our fixed sample reaches.

Observable impact: a new reference row on provider pages and a re-weighted large-scale crawling use-case. The composite score, its weights and every other ranking are unchanged.

v2.3.32026-07-17current

Two display-layer changes. The composite score, weights and rankings are untouched — both affect how existing numbers are presented, not how they are computed.

→
Clean-IP threshold now matches AbuseIPDB's semantics. "Clean" previously meant an abuse-confidence score of exactly 0, which counted the 1–24 noise band (one or two uncorroborated reports — routine for shared residential IPs) as dirty and showed 64–89% for pools whose average abuse score was 1.5/100. The threshold is now AbuseIPDB's own "suspicious" line (≥25); across 3,800 scored exit IPs only 27 sit above it. Displayed clean rates moved to 94–100% accordingly.
→
Published star-rating scale (see Star Rating Scale). Stars were previously score÷20, which misread our deliberately harsh composite (a market-leading 82 displayed as 4.1★). Ratings now go through published calibration bands with newcomer shrinkage and a per-dimension gate. Stars remain display-only; rankings sort by the composite.

Observable impact: higher clean-IP percentages and higher star ratings across the board, with unchanged provider order. Star badges on provider sites temporarily show the verified (no-number) design while we collect provider feedback on the scale.

v2.3.22026-07-06

Google and Amazon success rates are now measured separately. We've probed each target individually for weeks, but the daily rollup was averaging both into one number, so the Google and Amazon figures on the use-case pages were identical. They now come from each target's own probes.

→
Per-target split. /best/residential-proxy-for-google-serp-analysis and the Amazon page now reflect that target's real reachability, not a blended average. We recomputed history so the split applies across the whole window, not just going forward.

Observable impact: Google and Amazon success figures diverge where the pools actually behave differently. Composite score and weights are unchanged — this only splits a display metric that was already demoted from use-case ranking in v2.3.

v2.3.12026-07-02

Two honesty upgrades: the use-case rankings got a formal, gated formula, and the dashboard trend charts now plot real measured history. Composite weights unchanged.

→
Use-case scoring formalized (2026-06-30). Each /best/ page now ranks on a weighted sum of five orthogonal, independently measured axes — latency, session stability, neutral reachability, target reachability, and pool cleanliness — with fixed per-use-case weights, so no signal is double-counted. Clean IP Rate uses small-sample shrinkage (a provider with few sampled IPs is pulled toward the fleet average rather than trusted at face value). "Best for" badges are gated: a provider needs ≥14 days of data and a clear margin over the runner-up, so badges can't flip on day-to-day noise.
→
Trend charts now plot real daily history (2026-07-02). The dashboard's latency / success / score trend charts previously rendered a projected series derived from current values — a prototype leftover a provider's admin rightly called out. They now plot only actual measured days from the daily rollups: a new provider's line starts at its onboarding date (marked "Added"), a paused provider's line ends at its last measured day (marked "Paused"), and gaps stay visible instead of being interpolated. The latency trend begins 2026-06-30, when edge measurement started — earlier latency isn't comparable and isn't shown. Chart success rate is transport success; per-target rates live on provider pages.

Observable impact: /best rankings reflect the gated five-axis formula (a use-case winner needs history, not one good day). Charts show shorter, honest lines for new providers. No change to the composite score or its weights.

v2.32026-06-29

We measured what actually separates quality residential providers — and what doesn't. Anti-bot success rate is saturated industry-wide (~99% for every credible provider), so we stopped ranking on it; we added the dimension that does differentiate (IP reputation) and made latency measure the proxy, not our server's location.

→
Anti-bot success demoted & corrected. The Google/Amazon "success rate" previously showed ~100% for everyone because it ignored CAPTCHA/WAF challenges. Fixed to count them (real ~56-85%), and removed as a use-case ranking basis — success is saturated across all credible residential pools, so it can't tell them apart. The /best/ pagesnow rank by speed, reliability, and IP cleanliness. We keep a coarse "reaches Google/Amazon" liveness check but no longer rank on it.
→
IP reputation added (Clean IP Rate). We sample each provider's exit IPs and check them against AbuseIPDB. "Clean IP Rate" — the share of the pool with no abuse history — is a strong differentiator (43%-89% across our providers) that the industry uses and we were missing. Measured and shown, weight 0% until trusted across a longer window (same policy as geo).
→
Edge latency (regionalized). Headline latency (TTFB) is now measured to the nearest Cloudflare edge, the proxy's closest CDN on-ramp, the same clean network measurement independent benchmarks like Proxyway use. Each vantage measures its own region locally instead of every proxy reaching one fixed Frankfurt server. The previous probe under-reported real network latency, so headline figures now read higher, especially US residential at roughly 0.9–1.4s (mid-tier for US residential in independent tests). Latency before/after this date is not directly comparable.

Observable impact: US latency reads higher but honest (the old fixed-origin probe under-reported it); color thresholds recalibrated to the edge-benchmark scale (sub-second green, 0.8–2s amber, over 2s red). Use-case rankings shift to speed/reliability/cleanliness. Composite weights unchanged (35/30/0/15/20).

v2.22026-06-26

Geo de-muddying. Until now both of our test vantages (Germany and the US) probed each provider's default/global exit, so the test's own geography leaked into the number. We now pair each vantage to a controlled exit country and measure performance per region.

→
Vantage ↔ exit-geo pairing. The US vantage now tests US exits and the German vantage tests DE exits, so the measured latency isolates the proxy rather than the distance to the test server.
→
Controlled exit countries (US + DE). Replaces each provider's uncontrolled global exit. Providers without geo-routing (fixed-IP pools) are measured and labeled in the region(s) they actually offer.
→
Headline = blend across regions. The single score is the average of a provider's per-region scores — no bonus for breadth. Per-region detail is shown where available.
→
Geo integrity now measured (still 0% weight). We compare requested vs. observed exit country (currently 100% on geo-paired providers). It stays out of the composite until we trust it across more providers and time — an unproven metric should not move rankings.

Observable impact: providers previously tested on random global exits shift to reflect controlled US + DE performance, so scores before and after this date are not directly comparable. The composite weights (35/30/0/15/20) are unchanged.

v2.12026-05-29

Honest self-audit while preparing for our Reddit launch. Found three structural issues in the composite formula and fixed them publicly before going live.

→
Fixed double-counting in core_network. Previously, success_rate in core_network_score was computed across all planes, so the same probe could feed into both core_network AND neutral_reachability dimensions. Now uses controlled-plane (Plane A) data only.
→
Removed geo_integrity from composite. country_match_rate was 0% for all providers due to incomplete geo enrichment and missing declared-country data. Returns in v2.2 with multi-country probes (Phase 43.5.7).
→
Reweighted composite from 30/30/20/10/10 to 35/30/0/15/20after geo removal. Target Reachability (Plane C anti-bot) gained the most, since it's the most user-relevant dimension.
→
Cleaned session_reliability signature. burst_failure_rate parameter was a placeholder, never measured. Documented and defaulted to 0 until Phase 43.7 ships extended-TTL session probes.
→
Historical rollups recomputed for the last 30 days so Score History charts reflect the new formula across the whole period. Pre-v2.1 backup retained.

Observable impact: composite scores went up across all active providers by ~14–16 points (correcting for the broken geo penalty). Ranking unchanged.

v2.02026-04

Initial Architecture 2.0 composite: five-dimension weighted score (30/30/20/10/10), three-plane testing architecture (Controlled / Neutral / High-Defense), HMAC-signed worker→backend pipeline.

White-Hat Ethical Framework

✓

No pay-to-play. Providers cannot pay for better rankings or placement. Scores are generated purely from automated tests.

✓

Affiliate links never affect rankings. Where a provider runs an affiliate program, our outbound links may be affiliate links and we may earn a commission. This has no bearing on scores or placement, which come only from measured performance.

✓

Safe target sampling. We do not aggressively scrape Google or Amazon. Plane C uses strict cooldowns (6h initial, 24h escalated) and stop-after-deny to prevent IP reputation damage.

✓

Per-IP cooldowns. Each exit IP that receives a deny or challenge is placed on cooldown, preventing repeated probing of the same IP on protected targets.

✓

Full transparency. Our scoring formula, weights, and methodology are published publicly. The three-plane architecture ensures we measure real proxy quality without causing harm.

✓

Tamper-proof results. Each probe is signed with HMAC-SHA256. Any modification to stored results is detectable.

✓

Multi-region fairness. All providers are tested from identical infrastructure in EU and US regions, eliminating geographic bias.

Questions about our methodology?

Reach out to us on X (Twitter) or Telegram. We're happy to explain any aspect of our testing process in detail.

Methodology Updates

full log →

v2.3.42026-07-17current

Added a rotation-depth reference metric (unique exit IPs per fixed request budget) from our own periodic testing. Reference only, 0 weight in the composite; it does factor the large-scale web-crawling use-case.

v2.3.32026-07-17

Display layer: clean-IP threshold now uses AbuseIPDB's own ≥ 25 'suspicious' line; star ratings move from score÷20 to published calibration bands. Composite untouched.

v2.3.22026-07-06

Google and Amazon success rates split into per-target measurements (previously blended into one number on use-case pages).

v2.32026-06-29

Anti-bot demoted (saturated). IP reputation added (Clean IP Rate). Latency now measured at the nearest CDN edge, not a fixed origin.

v2.22026-06-26

Geo de-muddying: controlled US + DE exits, vantage↔geo pairing, headline blended across regions. geo_integrity measured (0% weight).

v2.12026-05-29

Fixed double-counting in core_network. Removed broken geo metric. Reweighted to 35/30/0/15/20.

v2.02026-04

Initial five-dimension composite. Three-plane architecture.