This article is the technical companion to [The State of B2B Email Tracking in 2026](/blog/state-of-email-tracking-2026). That piece documented what broke. This piece documents the framework that fixes it.
The premise is straightforward. Raw open counts in 2026 are inflated by 30 to 50 percentage points on most B2B lists, because Apple Mail Privacy Protection pre-fetches every pixel on delivery[1], Gmail's image proxy refetches every pixel server-side[3], and corporate scanners (Microsoft Defender for Office 365, Proofpoint, Mimecast) pre-fetch every link and image during security scanning[4][5][6]. The aggregate "open rate" metric is broken. The per-prospect signal is still recoverable, but only if you grade every pixel fire by the likelihood it was a real human read.
That grading is what confidence scoring does. The five-tier framework below is what Outsolvi runs on every open, and a close approximation of what Microsoft Sales Copilot calls "verified vs preview" opens, what Yesware is starting to expose in 2026, and what HubSpot is hinting at in their roadmap.
The five tiers
| Tier | Confidence | What it means | What to do |
| 1 | 80 to 100 percent | High-confidence human read | Count as open. Surface to rep. |
| 2 | 60 to 80 percent | Likely human read | Count as open. Show tier badge. |
| 3 | 40 to 60 percent | Uncertain | Exclude from open count by default. Show in diagnostic view. |
| 4 | 20 to 40 percent | Likely proxy or scanner | Exclude. Show in diagnostic view. |
| 5 | 0 to 20 percent | Known machine | Exclude. Roll up in inflation-rate metric. |
The threshold for "count as an open" varies by team and tracker. The Outsolvi default is 25 percent. anything below that drops out of the open count entirely but stays visible in the inflation-rate diagnostic. Some teams lift this to 40 percent for stricter filtering. Some hold-out teams want to keep the raw count for backward compatibility with old benchmarks; they set the threshold to zero and rely on the tier breakdown to interpret the number.
Signal one: IP reputation
The dominant signal in confidence scoring is the IP address that fetched the pixel. Apple publishes the IP ranges its Mail Privacy Protection proxy operates from, and although the ranges rotate on a privacy-preserving schedule[8], the rotation pool is finite and trackers maintain it.
Google's image proxy infrastructure is centered on the 66.249.x.x, 64.233.x.x, and 209.85.x.x blocks, with several adjacent blocks in active use. A pixel fetched from any of these IPs is almost certainly a Gmail proxy fetch, not the recipient's own device.
Microsoft Defender for Office 365 fetches through Microsoft's Azure IP space, which is enormous and not exclusively used for security scanning. The fingerprint that disambiguates here is the User-Agent (covered below). a Microsoft IP plus a Defender-specific User-Agent classifies as Tier 5.
Proofpoint and Mimecast both publish customer-facing allowlists of their outbound URL-rewrite IP ranges. Trackers cross-reference these. A fetch from one of these IPs is a scanner hit, not a human read.
Where the IP signal gets ambiguous is the broader corporate-VPN scenario. An employee at a financial-services firm reading email from their home through the corporate VPN routes the pixel fetch through the company's edge. The edge IP is often the same range the corporate scanner uses, because both come from the corporate egress firewall. The disambiguation here is timing. the scanner hits within seconds of delivery, the human VPN-routed fetch arrives later. and User-Agent, because the scanner's User-Agent is product-specific while the human's is the recipient's actual mail client.
Signal two: User-Agent fingerprinting
The User-Agent header on the HTTP request that fetches the tracking pixel is the second signal, and sometimes the most decisive one[7].
Apple Mail's image-proxy fetcher sends a distinctive User-Agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)` with no Safari version identifier. No human browser sends exactly that string. A request with this User-Agent classifies as Tier 5 regardless of IP.
Google Image Proxy identifies itself with strings containing `via ggpht.com` or, for newer proxy nodes, an explicit `GoogleImageProxy` identifier. Same treatment. Tier 5 on User-Agent alone.
Microsoft Defender for Office 365 sends User-Agents containing the substring `BingPreview` or product-specific identifiers depending on whether Safe Attachments or Safe Links is the active scanner. Proofpoint's URL Defense sends `Mozilla/4.0 (compatible; MSIE 6.0; ProofPoint URL Defense)` on its rewriting hits. Mimecast sends `Mimecast-Inspect-Image-Proxy` on its image scans.
Each of these User-Agents is a hard Tier 5 classification. The User-Agent signal is more reliable than IP because while an enterprise VPN might mask the corporate IP, the scanner's User-Agent is set by the scanner regardless of egress IP. A confidence-scoring tracker that misses these User-Agents is misclassifying the most easily-classifiable cases.
The harder problem is the long tail of unknown User-Agents from older mail clients, custom corporate scanners, and bespoke security tools. The default for an unrecognized User-Agent is Tier 3 (uncertain). neither counted nor excluded with strong confidence. pending review.
Signal three: timing relative to send
Time elapsed between the email's send timestamp and the pixel-fetch timestamp is the third signal.
Empirically, on B2B sends:
| Time since send | What it usually is |
| 0 to 90 seconds | Apple MPP pre-fetch, corporate scanner, Gmail proxy pre-fetch |
| 90 seconds to 4 hours | Likely human read |
| 4 hours to 24 hours | Almost certainly human read |
| Over 24 hours | Mostly human read, some Gmail cache refetches |
The first bucket is the strongest negative signal in the framework. No human at scale reads an email and engages with it within 90 seconds of delivery. even mobile push notification flows take longer than that, because Apple's MPP pre-fetch beats the notification. A pixel fetch inside the first 90 seconds is roughly 9 out of 10 times a proxy or scanner. Confidence-scoring trackers apply a Tier 3 or Tier 4 penalty to any fetch in this window, even if the IP and User-Agent look human.
The opposite bucket. fetches more than 24 hours after send. has its own dynamic. Gmail's image proxy occasionally refetches pixels server-side hours or days after the original delivery, when the user revisits the thread. These late refetches look human in timing but reveal themselves in User-Agent. Trackers that only use timing will overcount; trackers that combine timing with User-Agent handle this correctly.
Signal four: device and screen fingerprinting
The fourth signal is the weakest but the most distinguishing when present. When the tracking pixel is implemented as a small image with JavaScript callback (rather than a pure image tag), the recipient's browser exposes screen size, color depth, language preference, and viewport dimensions. These do not get sent through Apple's MPP proxy or through corporate scanners, because the proxy fetches the image at HTTP level and never executes the JS callback.
A pixel fetch that includes valid device fingerprint data is therefore strongly likely to be a human read on the recipient's own device. The signal is only available on roughly 40 percent of opens. Outlook on Desktop blocks JS callbacks, Apple Mail blocks remote script execution by default. so it cannot be the primary signal, but where present it lifts a Tier 2 classification to Tier 1.
How the signals combine
The four signals do not combine through a simple weighted sum. The combination logic is roughly:
- If the User-Agent is a known proxy or scanner, hard-classify as Tier 5. (Override everything else.)
- If the IP is a known proxy or scanner range AND the User-Agent doesn't disambiguate, classify Tier 4 or 5 depending on which range.
- If the IP is unknown AND the timing is within the first 90 seconds, classify Tier 3 with a leaning toward 4.
- If the IP is residential or commercial-non-scanner AND the timing is after 90 seconds AND the User-Agent is a real browser, classify Tier 1 or Tier 2 depending on device fingerprint availability.
- If signals conflict (e.g., good IP, bad timing, real User-Agent), default to Tier 3 and surface for human review.
The hard-classify behavior on User-Agent matters because the scanner User-Agents are deterministic. A confidence-scoring system that does weighted averaging across signals will misclassify a clearly-machine fetch when other signals look ambiguous. Hard overrides on the deterministic signals prevent this.
Edge cases
A few edge cases routinely break naive confidence-scoring implementations.
Outlook on Web through Microsoft Defender. The recipient reads the email in Outlook Web Access, which fetches the pixel from a Microsoft Azure IP with a Microsoft Edge User-Agent. Defender pre-fetched the pixel separately five minutes earlier from the same Azure IP space with a different User-Agent. A naive tracker counts both as opens. A confidence-scoring tracker recognizes the Defender User-Agent on the first hit (Tier 5) and the human Edge User-Agent on the second hit (Tier 1).
BYOD on a corporate scanner network. The recipient checks email on their personal phone through the corporate Wi-Fi VPN. The pixel fetch routes through the corporate egress, with a residential Apple iPhone User-Agent. IP looks like a scanner range but User-Agent is clearly a phone. Classify Tier 2 with a note. This is the false-positive case the framework is least good at. the tier band of 60 to 80 percent confidence reflects the genuine uncertainty.
Apple Mail with MPP disabled. The recipient is on Apple Mail but turned MPP off. The pixel fetches directly from the recipient's IP with the real Apple Mail User-Agent. No proxy interposes. This is genuinely Tier 1, but a tracker that hard-classifies all "Apple Mail User-Agent" fetches as Tier 5 will misclassify it. The disambiguation is the IP. Apple's MPP proxy IPs are distinct from residential IPs.
Litmus and SendForensics opening for QA. Email QA platforms like Litmus and SendForensics render emails for testing purposes, and their pixel fetches look like real opens. Both publish their IP ranges. Trackers should classify these as Tier 5 (known machine) and surface them in the diagnostic view tagged as "QA tools" so the team can confirm the email rendered correctly without inflating the human-read count.
What to do in your dashboard
Confidence-scoring data is only useful if the rep workflow surfaces it correctly. A pattern that consistently works:
The default "Opens" column on the rep's deal view shows Tier 1 plus Tier 2 counts only. The Tier 3, 4, and 5 fetches are excluded by default. Next to each shown open, a small confidence badge (a green dot for Tier 1, a yellow dot for Tier 2) signals the underlying confidence without overwhelming the UI.
A separate "Diagnostic view" or "Open audit" screen shows the full tier breakdown, useful for ops and for the rep who wants to understand why "5 opens on the proposal" became "2 opens" after the tier filter. The diagnostic view shows the inflation rate at the top. "your raw open count is inflated 38 percent above your human-read count this month". which becomes a useful trend metric on its own.
Don't hide the tier system. Reps who can see the tier on each open develop intuition for what the badges mean ("Tier 2 open from a residential IP in Boston at 11am Tuesday" reads as a likely real read) and start trusting the filtered count. Reps who only see the post-filter number with no transparency tend to lose trust in the tracker when one of their hot leads "drops" an open after a tier reclassification.
What this changes about email tracking
The shift from "raw open count" to "tier-classified open count" is the structural answer to the post-MPP measurement problem. The old metric is broken because the underlying behavior changed. pixels fire for reasons that have nothing to do with human reads. The new metric is durable because it grades the request, not the count.
The framework above is what Outsolvi runs. Microsoft Sales Copilot ships a two-tier version of the same idea (verified vs preview)[5]. Yesware is starting to expose tier badges in 2026. HubSpot's product team has discussed a tier model on their public roadmap. The category is converging on this approach because the underlying problem is real and the framework is the right shape for it.
If you want to see your own list's tier breakdown. what fraction of your reported opens are Tier 1 versus Tier 4. [try Outsolvi free for 14 days](https://my.outsolvi.com/signup). The diagnostic view is on by default during the trial. Most teams find their inflation rate is between 30 and 50 percent, which lines up with the broader B2B average documented in the [State of B2B Email Tracking](/blog/state-of-email-tracking-2026) piece.
The takeaway
Email tracking did not stop working in 2021. The metric did. Confidence scoring is what restores the per-prospect signal that raw counts used to carry. The five-tier framework is the operational shape that signal takes. IP reputation, User-Agent fingerprinting, timing, and device fingerprint, combined through hard overrides on the deterministic signals and weighted classification on the ambiguous ones.
Teams that adopt confidence scoring stop chasing phantom opens and start routing on the signal that actually predicts pipeline outcomes. Teams that keep optimizing on raw counts will keep wondering why their dashboards say "high engagement" while their reply rates stay flat.