Methodology v2.1: I Audited My Own Composite Formula. Here's What I Fixed

Methodology v2.1: I Audited My Own Composite Formula. Here's What I Fixed.

A few days ago I was about to publish proxystats.io on r/webscraping. The "submit" button was hovering. And then I stopped.

If you build an independent benchmark and post it to a technical community, the first thing experienced readers will do is read your methodology line by line. They'll find the parts that don't match. They'll ask the hard questions in the comments. And if I can't answer those questions cleanly, the credibility I'm trying to build dies right there.

So before clicking submit, I sat down and re-read my own composite formula like an outside reviewer would. I found three things I had to fix.

This post documents all three, what the impact was, and what I learned. Every change is live, the historical scores are recomputed, and the whole audit trail is public.

Fix 1: Double-counting between core_network and other dimensions

The composite score has five dimensions:

Core Network Performance (latency, TCP connect, TLS, basic success)
Session Reliability (sticky session survival, mid-session rotation)
Geo Integrity (does exit IP match declared country)
Neutral Reachability (success against httpbin / Cloudflare speed test)
Target Reachability (success against high-defense targets — Google, Amazon)

Each dimension has its own weight. The intent is clean separation: every probe contributes to exactly one dimension based on what it's measuring.

The bug: in the rollup code, success_rate for the core_network dimension was being computed across all planes (controlled + neutral + high-defense) instead of just the controlled plane (Plane A). Same probe was getting double-counted into both core_network AND its own dimension's score.

# What the code was doing — counts every probe regardless of plane
total_count += 1
if outcome == "http_success":
    success_count += 1

# success_rate = success_count / total_count
# core_score = (success_rate * 0.6) + ...

The semantic intent of core_network is "Plane A controlled-endpoint performance." But the math was averaging in neutral and high-defense plane results. A successful probe on httpbin.org was bumping both core_network AND neutral_reachability scores. A successful Google probe was bumping both core_network AND target_reachability.

This isn't catastrophic — it doesn't reverse rankings — but it makes the composite weights mean something different than what's written in the methodology. The actual effective weight of cross-plane success was higher than the published 30% / 30% / 20% / 10% / 10% would suggest.

The fix: separate counters per plane. core_network_score now uses controlled-plane data only. Documented in code, in methodology, and in this post.

Fix 2: Geo integrity was 0% across all providers — and crushing scores

This one was structural.

The geo dimension was supposed to measure: "when a provider claims to deliver IPs in country X, does the exit IP actually resolve to country X?" A 20% weight in the composite — the third-largest dimension.

But the metric was broken in two ways simultaneously:

The geo enrichment lookup (country_observed) was being called only in one probe type, not all of them. So out of ~3000 probes per provider per month, only ~300 had geo data populated.
The provider records had empty available_countries lists, so the match-rate computation had nothing to match against.

Result: country_match_rate was 0.0% for every provider, every day, since launch. And because geo had 20% weight, every provider was silently losing ~17 composite points to a metric that wasn't actually measuring anything.

The honest options:

A) Backfill available_countries with whatever the provider advertises on their website → metric works but only measures "is the observed country in the provider's marketing copy", which is weak.
B) Implement multi-country probes (request US, verify US; request DE, verify DE) → real measurement but requires multi-country provider plans we don't all have.
C) Remove the broken metric from the composite until we can measure it properly.

I went with C — for now. Multi-country probes for Maskify (their plan covers it) ship in the next phase. When that data is solid, geo returns to the composite with a real measurement.

Reweighted the composite:

v2.0:  Core 30% · Session 30% · Geo 20% · Neutral 10% · Target 10%
v2.1:  Core 35% · Session 30% · Geo 0%  · Neutral 15% · Target 20%

The biggest shift is Target Reachability going from 10% to 20%. That's the high-defense plane (Google, Amazon) — the most user-relevant dimension for anyone scraping anti-bot-protected sites. After we shipped content validation a few days ago (HTTP 200 + CAPTCHA page is no longer counted as success), this dimension is now reliable enough to carry more weight.

Fix 3: A parameter that was never measured

The session reliability score function takes three inputs:

def calculate_session_reliability_score(
    sticky_survival_rate,
    unexpected_rotation_rate,
    burst_failure_rate,   # ← always passed 0.0
):
    ...

burst_failure_rate was a placeholder from when this score was designed. It was supposed to track mid-session TCP connection drops — different from "unexpected rotation" (provider gave you a new IP) and different from "session expired" (TTL ran out).

It was always passed 0.0 because we never wrote the detection code. So the third penalty term in the formula was always zero.

Not a correctness bug exactly — the math still works with 0.0 in there — but it's misleading. Someone reading the code thinks we're measuring something we're not.

The fix: defaulted the parameter to 0, documented its real status, and noted that it returns when we ship extended-TTL session probes (Phase 43.7 — testing 5- and 15-minute sessions instead of just 60-second sessions).

While I was in there, I also documented the ×3 penalty multiplier on unexpected rotation. This was undocumented behavior that creates a noticeable score discontinuity: at 33% unexpected rotation, the score floors at 0 because unexpected_rotation × 3 × 100 exceeds the 100-point base. That's intentional — high mid-session rotation breaks stateful workflows so thoroughly that further differentiation is meaningless — but it should be explicit, not hidden in the code.

Impact on scores

After all three fixes plus a recompute of the last 30 days of historical rollups:

Provider	v2.0 score	v2.1 score	Change
Maskify (DE/US)	77	93	+16
Aceproxies (DE/US)	62	77	+15
GonzoProxy (DE/US)	53	65	+12

Every provider went up. Ranking didn't change. The change came from removing a broken metric that was equally penalizing all of them (geo, −17 points), partially offset by the rebalancing.

This is the kind of result you want from an integrity fix: not a leaderboard reshuffle, just more accurate absolute numbers for everyone.

Score History graphs reflect the new formula across the whole 30-day window because we recomputed historical rollups. Pre-v2.1 numbers are preserved in a backup; if you ever want to compare the two formulas on the same data, that's recoverable.

What I learned

I'm publishing this for two reasons.

One — if someone runs into the same kind of bug in their own benchmark code (cross-plane double counting is genuinely subtle), this is documentation that the pattern exists and how to spot it.

Two — the only credibility move available to an independent benchmark is to publicly audit your own work before a critic does. The alternative is hoping nobody notices, which is a strategy with a one-way door: the moment somebody does notice and writes a callout post, the entire project's credibility takes the hit.

Doing this audit before launching on Reddit means that if someone in the comments asks "how do I trust your numbers?", I can point to this post and say: "here's exactly what I got wrong, here's what I fixed, here's the timestamp."

I'd rather lose two days to a self-audit than lose a year of credibility to an avoidable callout.

What's next

The biggest thing geo removal leaves on the table is a real geo-targeting measurement. That ships next — multi-country probes on Maskify (they have the widest plan to support it), testing 5 countries (US, DE, FR, JP, BR), then expanding as budget allows.

A few other improvements are in the pipeline:

Extended-TTL session probes (5- and 15-minute sessions, not just 60s)
Per-dimension drill-down UI on provider pages — see exactly why a score is what it is
Use Case Calculator — pick your own weights, get a custom composite for your specific workflow

Plus the things further out: confidence intervals on metrics, cross-correlation analysis between dimensions, anchor providers in the testing set (Smartproxy / Decodo first).

The methodology page now has a Changelog section — every meaningful formula change gets a public entry with date, rationale, and impact. If you ever wonder "did the score move because the provider changed or because the formula changed?", that page is the answer.

Found a methodology bug? Tell me. Methodology PRs are welcome.