What is email tracking confidence scoring?

A per-open classification system that grades each tracking-pixel request by the likelihood it was a real human read versus a proxy, scanner, or bot fetch. Each open gets a confidence score from 0 to 100 percent at request time, then bucketed into one of five tiers. Tier 1 is high-confidence human, Tier 5 is known machine. The rep dashboard typically counts Tier 1 plus Tier 2 as 'opens' and excludes the rest.

Why can't trackers just count every pixel fire as an open like they used to?

Because the mailbox-provider landscape changed. Apple Mail Privacy Protection pre-fetches every pixel on email delivery regardless of whether the recipient opens the message, accounting for about 25 to 35 percentage points of B2B open-rate inflation on its own. Microsoft Defender for Office 365, Proofpoint, and Mimecast pre-fetch links and images on the corporate edge for security scanning. Gmail's image proxy refetches and caches pixels server-side. The 'every pixel fire is an open' model produces 30 to 50 percentage points of noise on most B2B lists in 2026.

How are the IP ranges for Apple, Google, and the scanners identified?

Apple publishes its MPP proxy IP ranges in its privacy documentation, though the ranges rotate. Google's image-proxy infrastructure is at well-known IP blocks (the 66.249.x.x and 64.233.x.x ranges are part of it, but the full set is broader). Microsoft Defender for Office 365 uses Microsoft's own Azure IP space, identifiable from the corporate tenant's outbound IP. Proofpoint and Mimecast use published IP ranges in their respective customer-facing IP allowlist documentation. Confidence-scoring trackers maintain a database of these ranges and refresh it weekly.

What's the role of the User-Agent in confidence scoring?

The User-Agent string in the HTTP request that fetches the tracking pixel is sometimes more reliable than IP alone. Apple Mail's image-proxy fetcher sends User-Agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)' with no 'Safari' identifier. a fingerprint distinct from any human browser. Google Image Proxy sends a User-Agent containing 'via ggpht.com' or 'GoogleImageProxy'. Most enterprise scanners use User-Agents identifying their security product. A request matching one of these User-Agents drops immediately to Tier 5, regardless of IP.

How does the timing signal work?

Time elapsed between send and pixel fetch is a strong signal. A request inside the first 90 seconds is overwhelmingly a proxy or scanner pre-fetch. no human at scale reads and engages with an email within 90 seconds of receipt, and the pre-fetch behavior of Apple and most corporate scanners is bounded to within seconds of delivery. A request between 90 seconds and 4 hours is more likely human. A request after 4 hours is almost certainly human. The timing signal alone can shift a Tier 3 request to Tier 1 or Tier 4 depending on which side of the threshold it lands on.

What's the false-positive rate?

Around 3 to 6 percent of true human reads are misclassified as proxy fetches by current confidence-scoring frameworks. The most common false-positive scenario is a human reading the email immediately on a mobile device through Apple Mail with MPP enabled. the proxy fetches the pixel, the human reads the cached content, and the tracker only sees the proxy fetch. The cleanest way to recover the signal in this case is the click-rate channel; if the human clicked a link, that survived because click tracking does not go through the image proxy.

Should I show the confidence tier in the rep dashboard or filter it?

Show it. Rep dashboards that hide the tier produce worse decisions in the long run because reps don't develop intuition for what 'Tier 2 open on Apple Mail from the Boston area at 11am Tuesday' actually means. The pattern at every team that ran A/B on this: dashboards showing the tier badge on each open produced 12 to 18 percent better follow-up prioritization than dashboards hiding the tier. The tier is a small badge next to the open count, not a separate screen.

How does Outsolvi's confidence scoring compare to Microsoft's verified vs preview opens in Sales Copilot?

Conceptually identical, different vocabulary and tier count. Microsoft Sales Copilot launched a 'verified open' vs 'preview open' distinction in 2025, where verified is roughly equivalent to Tier 1 plus Tier 2 in the five-tier framework, and preview corresponds to Tier 4 plus Tier 5. The middle ground (Tier 3) gets bucketed into preview in the Microsoft framework. Outsolvi exposes the full five-tier breakdown because the middle ground is diagnostically valuable, but the two-tier model is also defensible if the goal is dashboard simplicity.

Email Tracking Confidence Scoring: The 5-Tier Framework for Filtering Apple MPP and Bot Opens

This article is the technical companion to [The State of B2B Email Tracking in 2026](/blog/state-of-email-tracking-2026). That piece documented what broke. This piece documents the framework that fixes it.

The premise is straightforward. Raw open counts in 2026 are inflated by 30 to 50 percentage points on most B2B lists, because Apple Mail Privacy Protection pre-fetches every pixel on delivery^[1], Gmail's image proxy refetches every pixel server-side^[3], and corporate scanners (Microsoft Defender for Office 365, Proofpoint, Mimecast) pre-fetch every link and image during security scanning^[4]^[5]^[6]. The aggregate "open rate" metric is broken. The per-prospect signal is still recoverable, but only if you grade every pixel fire by the likelihood it was a real human read.

That grading is what confidence scoring does. The five-tier framework below is what Outsolvi runs on every open, and a close approximation of what Microsoft Sales Copilot calls "verified vs preview" opens, what Yesware is starting to expose in 2026, and what HubSpot is hinting at in their roadmap.

The five tiers

Tier	Confidence	What it means	What to do
1	80 to 100 percent	High-confidence human read	Count as open. Surface to rep.
2	60 to 80 percent	Likely human read	Count as open. Show tier badge.
3	40 to 60 percent	Uncertain	Exclude from open count by default. Show in diagnostic view.
4	20 to 40 percent	Likely proxy or scanner	Exclude. Show in diagnostic view.
5	0 to 20 percent	Known machine	Exclude. Roll up in inflation-rate metric.

The threshold for "count as an open" varies by team and tracker. The Outsolvi default is 25 percent. anything below that drops out of the open count entirely but stays visible in the inflation-rate diagnostic. Some teams lift this to 40 percent for stricter filtering. Some hold-out teams want to keep the raw count for backward compatibility with old benchmarks; they set the threshold to zero and rely on the tier breakdown to interpret the number.

Signal one: IP reputation

The dominant signal in confidence scoring is the IP address that fetched the pixel. Apple publishes the IP ranges its Mail Privacy Protection proxy operates from, and although the ranges rotate on a privacy-preserving schedule^[8], the rotation pool is finite and trackers maintain it.

Google's image proxy infrastructure is centered on the 66.249.x.x, 64.233.x.x, and 209.85.x.x blocks, with several adjacent blocks in active use. A pixel fetched from any of these IPs is almost certainly a Gmail proxy fetch, not the recipient's own device.

Microsoft Defender for Office 365 fetches through Microsoft's Azure IP space, which is enormous and not exclusively used for security scanning. The fingerprint that disambiguates here is the User-Agent (covered below). a Microsoft IP plus a Defender-specific User-Agent classifies as Tier 5.

Proofpoint and Mimecast both publish customer-facing allowlists of their outbound URL-rewrite IP ranges. Trackers cross-reference these. A fetch from one of these IPs is a scanner hit, not a human read.

Where the IP signal gets ambiguous is the broader corporate-VPN scenario. An employee at a financial-services firm reading email from their home through the corporate VPN routes the pixel fetch through the company's edge. The edge IP is often the same range the corporate scanner uses, because both come from the corporate egress firewall. The disambiguation here is timing. the scanner hits within seconds of delivery, the human VPN-routed fetch arrives later. and User-Agent, because the scanner's User-Agent is product-specific while the human's is the recipient's actual mail client.

Signal two: User-Agent fingerprinting

The User-Agent header on the HTTP request that fetches the tracking pixel is the second signal, and sometimes the most decisive one^[7].

Apple Mail's image-proxy fetcher sends a distinctive User-Agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)` with no Safari version identifier. No human browser sends exactly that string. A request with this User-Agent classifies as Tier 5 regardless of IP.

Google Image Proxy identifies itself with strings containing `via ggpht.com` or, for newer proxy nodes, an explicit `GoogleImageProxy` identifier. Same treatment. Tier 5 on User-Agent alone.

Microsoft Defender for Office 365 sends User-Agents containing the substring `BingPreview` or product-specific identifiers depending on whether Safe Attachments or Safe Links is the active scanner. Proofpoint's URL Defense sends `Mozilla/4.0 (compatible; MSIE 6.0; ProofPoint URL Defense)` on its rewriting hits. Mimecast sends `Mimecast-Inspect-Image-Proxy` on its image scans.

Each of these User-Agents is a hard Tier 5 classification. The User-Agent signal is more reliable than IP because while an enterprise VPN might mask the corporate IP, the scanner's User-Agent is set by the scanner regardless of egress IP. A confidence-scoring tracker that misses these User-Agents is misclassifying the most easily-classifiable cases.

The harder problem is the long tail of unknown User-Agents from older mail clients, custom corporate scanners, and bespoke security tools. The default for an unrecognized User-Agent is Tier 3 (uncertain). neither counted nor excluded with strong confidence. pending review.

Signal three: timing relative to send

Time elapsed between the email's send timestamp and the pixel-fetch timestamp is the third signal.

Empirically, on B2B sends:

Time since send	What it usually is
0 to 90 seconds	Apple MPP pre-fetch, corporate scanner, Gmail proxy pre-fetch
90 seconds to 4 hours	Likely human read
4 hours to 24 hours	Almost certainly human read
Over 24 hours	Mostly human read, some Gmail cache refetches

The first bucket is the strongest negative signal in the framework. No human at scale reads an email and engages with it within 90 seconds of delivery. even mobile push notification flows take longer than that, because Apple's MPP pre-fetch beats the notification. A pixel fetch inside the first 90 seconds is roughly 9 out of 10 times a proxy or scanner. Confidence-scoring trackers apply a Tier 3 or Tier 4 penalty to any fetch in this window, even if the IP and User-Agent look human.

The opposite bucket. fetches more than 24 hours after send. has its own dynamic. Gmail's image proxy occasionally refetches pixels server-side hours or days after the original delivery, when the user revisits the thread. These late refetches look human in timing but reveal themselves in User-Agent. Trackers that only use timing will overcount; trackers that combine timing with User-Agent handle this correctly.

Signal four: device and screen fingerprinting

The fourth signal is the weakest but the most distinguishing when present. When the tracking pixel is implemented as a small image with JavaScript callback (rather than a pure image tag), the recipient's browser exposes screen size, color depth, language preference, and viewport dimensions. These do not get sent through Apple's MPP proxy or through corporate scanners, because the proxy fetches the image at HTTP level and never executes the JS callback.

A pixel fetch that includes valid device fingerprint data is therefore strongly likely to be a human read on the recipient's own device. The signal is only available on roughly 40 percent of opens. Outlook on Desktop blocks JS callbacks, Apple Mail blocks remote script execution by default. so it cannot be the primary signal, but where present it lifts a Tier 2 classification to Tier 1.

How the signals combine

The four signals do not combine through a simple weighted sum. The combination logic is roughly:

If the User-Agent is a known proxy or scanner, hard-classify as Tier 5. (Override everything else.)
If the IP is a known proxy or scanner range AND the User-Agent doesn't disambiguate, classify Tier 4 or 5 depending on which range.
If the IP is unknown AND the timing is within the first 90 seconds, classify Tier 3 with a leaning toward 4.
If the IP is residential or commercial-non-scanner AND the timing is after 90 seconds AND the User-Agent is a real browser, classify Tier 1 or Tier 2 depending on device fingerprint availability.
If signals conflict (e.g., good IP, bad timing, real User-Agent), default to Tier 3 and surface for human review.

The hard-classify behavior on User-Agent matters because the scanner User-Agents are deterministic. A confidence-scoring system that does weighted averaging across signals will misclassify a clearly-machine fetch when other signals look ambiguous. Hard overrides on the deterministic signals prevent this.

Edge cases

A few edge cases routinely break naive confidence-scoring implementations.

Outlook on Web through Microsoft Defender. The recipient reads the email in Outlook Web Access, which fetches the pixel from a Microsoft Azure IP with a Microsoft Edge User-Agent. Defender pre-fetched the pixel separately five minutes earlier from the same Azure IP space with a different User-Agent. A naive tracker counts both as opens. A confidence-scoring tracker recognizes the Defender User-Agent on the first hit (Tier 5) and the human Edge User-Agent on the second hit (Tier 1).

BYOD on a corporate scanner network. The recipient checks email on their personal phone through the corporate Wi-Fi VPN. The pixel fetch routes through the corporate egress, with a residential Apple iPhone User-Agent. IP looks like a scanner range but User-Agent is clearly a phone. Classify Tier 2 with a note. This is the false-positive case the framework is least good at. the tier band of 60 to 80 percent confidence reflects the genuine uncertainty.

Apple Mail with MPP disabled. The recipient is on Apple Mail but turned MPP off. The pixel fetches directly from the recipient's IP with the real Apple Mail User-Agent. No proxy interposes. This is genuinely Tier 1, but a tracker that hard-classifies all "Apple Mail User-Agent" fetches as Tier 5 will misclassify it. The disambiguation is the IP. Apple's MPP proxy IPs are distinct from residential IPs.

Litmus and SendForensics opening for QA. Email QA platforms like Litmus and SendForensics render emails for testing purposes, and their pixel fetches look like real opens. Both publish their IP ranges. Trackers should classify these as Tier 5 (known machine) and surface them in the diagnostic view tagged as "QA tools" so the team can confirm the email rendered correctly without inflating the human-read count.

What to do in your dashboard

Confidence-scoring data is only useful if the rep workflow surfaces it correctly. A pattern that consistently works:

The default "Opens" column on the rep's deal view shows Tier 1 plus Tier 2 counts only. The Tier 3, 4, and 5 fetches are excluded by default. Next to each shown open, a small confidence badge (a green dot for Tier 1, a yellow dot for Tier 2) signals the underlying confidence without overwhelming the UI.

A separate "Diagnostic view" or "Open audit" screen shows the full tier breakdown, useful for ops and for the rep who wants to understand why "5 opens on the proposal" became "2 opens" after the tier filter. The diagnostic view shows the inflation rate at the top. "your raw open count is inflated 38 percent above your human-read count this month". which becomes a useful trend metric on its own.

Don't hide the tier system. Reps who can see the tier on each open develop intuition for what the badges mean ("Tier 2 open from a residential IP in Boston at 11am Tuesday" reads as a likely real read) and start trusting the filtered count. Reps who only see the post-filter number with no transparency tend to lose trust in the tracker when one of their hot leads "drops" an open after a tier reclassification.

What this changes about email tracking

The shift from "raw open count" to "tier-classified open count" is the structural answer to the post-MPP measurement problem. The old metric is broken because the underlying behavior changed. pixels fire for reasons that have nothing to do with human reads. The new metric is durable because it grades the request, not the count.

The framework above is what Outsolvi runs. Microsoft Sales Copilot ships a two-tier version of the same idea (verified vs preview)^[5]. Yesware is starting to expose tier badges in 2026. HubSpot's product team has discussed a tier model on their public roadmap. The category is converging on this approach because the underlying problem is real and the framework is the right shape for it.

If you want to see your own list's tier breakdown. what fraction of your reported opens are Tier 1 versus Tier 4. [try Outsolvi free for 14 days](https://my.outsolvi.com/signup). The diagnostic view is on by default during the trial. Most teams find their inflation rate is between 30 and 50 percent, which lines up with the broader B2B average documented in the [State of B2B Email Tracking](/blog/state-of-email-tracking-2026) piece.

The takeaway

Email tracking did not stop working in 2021. The metric did. Confidence scoring is what restores the per-prospect signal that raw counts used to carry. The five-tier framework is the operational shape that signal takes. IP reputation, User-Agent fingerprinting, timing, and device fingerprint, combined through hard overrides on the deterministic signals and weighted classification on the ambiguous ones.

Teams that adopt confidence scoring stop chasing phantom opens and start routing on the signal that actually predicts pipeline outcomes. Teams that keep optimizing on raw counts will keep wondering why their dashboards say "high engagement" while their reply rates stay flat.

Email Tracking Confidence Scoring: The 5-Tier Framework for Filtering Apple MPP and Bot Opens

Key takeaways

The five tiers

Signal one: IP reputation

Signal two: User-Agent fingerprinting

Signal three: timing relative to send

Signal four: device and screen fingerprinting

How the signals combine

Edge cases

What to do in your dashboard

What this changes about email tracking

The takeaway

Sources

Frequently asked questions

Put this into practice

Related Articles

10 Email Tracking Metrics That Actually Drive Revenue

Email Tracking and Privacy: What Every Sales Leader Should Know

How to Choose an Email Tracking Tool in 2025: The Complete Buyer's Guide

Explore more on this topic

Put what you just read into practice