BPE Tokenization for Secret Detection: 98.6% Recall vs. 70.4% for Entropy
Secret detection has traditionally relied on two approaches: regular expression patterns that match known credential formats, and Shannon entropy analysis that flags high-randomness strings as potential secrets. The first approach is precise but limited to known formats. The second catches unknown formats but generates excessive false positives. At Netallion AI Assurance, we developed a third approach based on byte-pair encoding (BPE) tokenization that combines the precision of pattern matching with the coverage of statistical analysis, achieving 98.6% recall on generic secrets compared to 70.4% for entropy-only methods.
The Problem with Entropy Analysis
Shannon entropy measures the randomness of a string by calculating the distribution of characters. A string like "password123" has low entropy because the character distribution is predictable. A string like "aK7mN2pQ9rS4tV6wX8yZ0bD3eF5gH" has high entropy because the characters are well-distributed. Entropy analysis uses a threshold to classify high-entropy strings as potential secrets.
The problem is that many legitimate strings also have high entropy. UUIDs ("550e8400-e29b-41d4-a716-446655440000"), content hashes, base64-encoded non-secret data, and even some natural language produce entropy values above typical thresholds. In our benchmarks, entropy-only analysis with a threshold tuned for reasonable recall flagged 15 to 25 percent of all strings in a typical codebase as potential secrets. This makes entropy analysis impractical as a primary detection mechanism without extensive suppression tuning.
Lowering the entropy threshold reduces false positives but also reduces recall. Our benchmarks showed that achieving a 5% false positive rate with entropy-only analysis reduced recall to 70.4% on generic secrets. Nearly one in three real secrets was missed.
What BPE Tokenization Brings to the Table
Byte-pair encoding is a subword tokenization algorithm originally developed for text compression and later adopted by large language models for vocabulary construction. BPE iteratively merges the most frequent pairs of bytes (or characters) in a corpus to build a vocabulary of subword tokens. The key insight for secret detection is that the tokenization pattern of a string reveals its structural properties in ways that entropy alone cannot capture.
When BPE processes a real secret like an API key, the tokenization tends to produce a characteristic pattern: relatively few merges, many single-character tokens, and low token-to-character ratio. This is because secrets are designed to be random and do not contain the repeated subword patterns that BPE learns from natural text or structured data.
When BPE processes a non-secret high-entropy string like a UUID or content hash, the tokenization is different. UUIDs have consistent structural patterns (groups of hexadecimal characters separated by hyphens) that BPE recognizes and merges efficiently. Base64-encoded data contains recurring character pairs that map to common byte patterns. These structural regularities produce higher merge rates and different token distribution profiles than true random secrets.
The Netallion AI Assurance Detection Pipeline
Netallion AI Assurance uses a three-stage detection pipeline that combines regex patterns, BPE tokenization analysis, and live verification.
In the first stage, 467 regular expression patterns scan for known secret formats. This catches the majority of secrets with specific, well-defined structures: AWS access keys, GitHub personal access tokens, Stripe API keys, and hundreds more. Regex matching is fast, precise, and produces minimal false positives for well-defined formats.
In the second stage, strings that do not match any known pattern are analyzed using BPE tokenization. The BPE model was trained on a corpus of confirmed secrets and confirmed non-secrets, learning the tokenization characteristics that distinguish the two categories. The model produces a confidence score for each candidate string. Strings above the threshold are classified as potential generic secrets.
In the third stage, high-confidence findings from both stages are validated against 20 live verifiers. The verifiers make API calls to the respective service providers (AWS STS, GitHub API, Stripe API, and others) to determine whether the detected credential is still active. Verified-active findings are escalated to critical severity. Verified-inactive findings are downgraded. Unverifiable findings retain their confidence-based severity.
Benchmark Results
We evaluated the Netallion AI Assurance detection engine against a curated dataset of 10,000 strings including 2,500 confirmed secrets (both known-format and generic), 2,500 high-entropy non-secrets (UUIDs, hashes, base64), 2,500 medium-entropy code strings, and 2,500 low-entropy text strings. Each approach was evaluated on recall (percentage of real secrets detected) and precision (percentage of detections that are real secrets).
Approach
Entropy-only
Regex + BPE
Recall
70.4%
98.6%
Precision
78.2%
95.1%
The Regex + BPE approach achieved 98.6% recall with 95.1% precision, compared to 70.4% recall with 78.2% precision for entropy-only. The improvement was most dramatic for generic secrets, which are strings that do not match any known regex pattern but are confirmed credentials. BPE tokenization detected 94.2% of generic secrets, compared to 41.8% for entropy-only at the same false positive threshold.
Implications for Enterprise Security
The 28-point recall gap between BPE tokenization and entropy-only analysis translates directly to missed secrets in production environments. For an organization with 1,000 generic secrets exposed across their infrastructure, entropy-only analysis would miss approximately 296 of them. BPE tokenization would miss approximately 14. Those 282 additional detected secrets represent real credentials that an attacker could exploit.
The precision improvement is equally important for enterprise adoption. A tool that generates excessive false positives is a tool that security teams learn to ignore. Netallion AI Assurance's 95.1% precision means that 19 out of 20 findings are real secrets, making every alert actionable. Combined with live verification that brings the effective false positive rate for verified types to near zero, the detection engine maintains the trust of security teams handling hundreds of findings per week.
What Comes Next
We are continuing to invest in the BPE tokenization approach. Our current roadmap includes expanding the training corpus with additional secret types, tuning the model for specific data sources (log entries have different characteristics than code files), and developing domain-specific BPE vocabularies for cloud provider credential formats. We believe that statistical structural analysis will become the standard approach for generic secret detection over the next two to three years, complementing rather than replacing regex pattern matching for known formats.
If you want to see the BPE detection engine in action on your own data, Netallion AI Assurance offers a 14-day free trial with full Professional tier access. Connect your Azure Monitor workspaces or GitHub repositories and see the difference that 98.6% recall makes for your secret detection coverage.
Experience 98.6% recall on your own data
14-day free trial. No credit card required. All 467 patterns + BPE tokenization.
Start Free Trial