Methodology
This page explains exactly how the Ordica Prompt Analyzer works. Read it in full before trusting the numbers it returns. We assume you are an AI engineer who will verify our claims independently — and we want to make that verification easy.
1. Tokenization
Each model uses its real, official tokenizer. We do not approximate GPT-4o, Claude, or Gemini token counts.
| Model | Tokenizer | Source | Where it runs |
|---|---|---|---|
| GPT-4o | tiktoken cl100k_base | Official OpenAI tokenizer | Local (this server) |
| Claude Sonnet 4.5 | Anthropic count_tokens | Official Anthropic API | Anthropic's servers |
| Grok-4 | cl100k_base (approximation) | OpenAI tokenizer used as proxy | Local |
| Gemini 2.5 Flash | countTokens REST API | Official Google API | Google's servers |
2. Section parsing
We try two paths to identify the structural sections of your prompt (system instructions, few-shot examples, RAG context, tool definitions, user queries, conversation history).
-
Structured path: if your prompt parses as JSON and
matches the OpenAI or Anthropic message-list format, we use the
rolefield on each message to classify exactly. This is the most reliable path. -
Heuristic path: if your prompt is raw text, we use
regex matching against well-known structural markers
(
"System:","<<SYS>>","Document N:","Example:", etc.) to slice the text into sections.
Each analysis reports the parsing method used and a confidence score from 0 to 1. If your prompt has ambiguous structure, the confidence will be lower and the cohort match may fall back to the conservative "combined" cohort.
3. Cohort matching
Once we know your prompt's structure, we match it to a cohort in our blind-judged benchmark and return that cohort's percentile range (p25 / median / p75) as the savings estimate.
The benchmark
- Sample size: 200 unique prompts × 4 providers = 800 quality validations
- Categories: instruction-heavy, conversation history, mixed structure, RAG-heavy
- Judge: LLM-as-judge (GPT-4o), blind to which response was the compressed version
- Metrics: token savings percentage, response equivalence (1-5), per-response quality (1-5)
- Public data: view the full cohort table
Per-cohort statistics
| Cohort | Savings range | Quality (avg eq) | Sample |
|---|---|---|---|
| Instruction-heavy | 21.9% – 32.9% (median 28.0%) | 4.22 / 5 | 50 prompts × 4 |
| Conversation history | 31.3% – 39.7% (median 35.5%) | 3.53 / 5 | 50 prompts × 4 |
| Mixed structure | 7.0% – 10.3% (median 8.7%) | 4.16 / 5 | 50 prompts × 4 |
| RAG-heavy | data pending | data pending | benchmark in progress |
We are not certain which. Cross-judge validation (Claude judging the OpenAI-compressed sample, and vice versa) is on the roadmap. Until that data lands, treat per-provider quality differences with appropriate skepticism. The savings numbers are not affected by this — they are deterministic per compressed prompt.
4. The compression engine never runs in this flow
This is the most important architectural decision. The Ordica Prompt Analyzer never invokes our compression engine on your input. Every savings number you see comes from looking up your prompt's structure in the benchmark cohort table — not from compressing your prompt and measuring the result.
There are two reasons:
- IP protection. Our compression methodology stays behind the paid product. This page exists to demonstrate competence, not to give away the engine.
- Honesty about cohort estimates. A "we ran our engine on your specific prompt and it saved exactly 28.4%" claim would be misleading. What we can honestly say is "prompts that look like yours have historically saved 22-30% in our blind-judged benchmark."
5. Privacy contract
What we log
- Timestamp (UTC)
- Salted+hashed IP, first 16 hex chars (salt rotates daily)
- Models you requested
- Total token count
- Matched cohort name
- Whether the fallback was used
- Parser method and confidence score
- Request duration in milliseconds
What we never log
- Your prompt content
- Section text
- Your identity, raw IP, cookies, session tokens
- Any field that could be used to reconstruct what you pasted
What leaves this server
- OpenAI / Grok tokenization: entirely local (tiktoken). Your prompt does not leave this server for these counts.
-
Claude tokenization: sends your prompt to Anthropic's
count_tokensAPI for tokenization only. Anthropic's data policy applies. -
Gemini tokenization: sends your prompt to Google's
countTokensREST API for tokenization only. Google's data policy applies. - Ordica's compression engine: never invoked in this flow. Stays behind the paid product.
Verify with the audit endpoint — it shows the policy and the last 20 (content-free) request entries.
6. Limitations
- RAG cohort data is incomplete; RAG-classified prompts use the combined cohort as a conservative proxy.
- Quality preservation varies by provider; lower-fidelity categories (conversation history) should be validated before production use.
- Estimates are directional. Real savings on a specific prompt depend on its exact structure and content.
- Tokenization for Grok uses cl100k_base as approximation; xAI does not publish a tokenizer.
- Section parsing is heuristic and may misclassify prompts that use unusual or custom markers. The structure score reports parser confidence.
7. Performance and out-of-scope
Typical analyzer latency: 200-800 ms end-to-end. The local tokenizers (OpenAI, Grok) return in under 1 ms. The remote tokenizers (Claude, Gemini) typically respond in 150-700 ms depending on prompt size and round-trip time. Total cost to you: zero.
Out of scope: This tool does not measure KV cache impact or end-to-end inference latency. Provider-side prompt caching (Anthropic, OpenAI) interacts with compression in ways the analyzer cannot evaluate from a single request. If KV cache economics are central to your workload, validate compression in a Shadow Mode deployment before adopting.
8. Verify the claims
Don't trust this page. Verify it.
- The full cohort data: /api/analyze/cohorts
- The methodology metadata: /api/analyze/methodology
- The audit log policy and recent entries: /api/analyze/audit
- OpenAI tokenizer:
pip install tiktoken; tiktoken.get_encoding("cl100k_base").encode("…") - Anthropic count_tokens: official docs
- Gemini countTokens: official docs
Pull the cohort data yourself
curl -s https://ordica.ai/api/analyze/cohorts | jq . # Or grab just one cohort's stats: curl -s https://ordica.ai/api/analyze/cohorts \ | jq '.cohorts.instruction_compression.savings_pct'
The cohort table is the source of truth. Everything the analyzer returns is derived from it. If you find a number on this site that doesn't match the cohort table, file a bug.
The benchmark math
"200 prompts × 4 providers" expands to: 200 unique prompts get compressed once, producing 200 deterministic savings measurements (the compressed prompt is the same regardless of which provider receives it). Each compressed prompt is then sent to all 4 providers and judged blindly against the original, producing 800 quality validations. So the math is 200 savings measurements + 800 quality measurements = 1,000 total data points.