Methodology

This page explains exactly how the Ordica Prompt Analyzer works. Read it in full before trusting the numbers it returns. We assume you are an AI engineer who will verify our claims independently — and we want to make that verification easy.

1. Tokenization

Each model uses its real, official tokenizer. We do not approximate GPT-4o, Claude, or Gemini token counts.

ModelTokenizerSourceWhere it runs
GPT-4o tiktoken cl100k_base Official OpenAI tokenizer Local (this server)
Claude Sonnet 4.5 Anthropic count_tokens Official Anthropic API Anthropic's servers
Grok-4 cl100k_base (approximation) OpenAI tokenizer used as proxy Local
Gemini 2.5 Flash countTokens REST API Official Google API Google's servers
Grok approximation honesty. xAI does not publish a tokenizer or expose a count_tokens endpoint. We use cl100k_base as a proxy because Grok models are GPT-architecture-derived. Real Grok counts typically differ from our reported number by 1-3%. If exact accuracy matters for your Grok workload, validate with a live API call.

2. Section parsing

We try two paths to identify the structural sections of your prompt (system instructions, few-shot examples, RAG context, tool definitions, user queries, conversation history).

Each analysis reports the parsing method used and a confidence score from 0 to 1. If your prompt has ambiguous structure, the confidence will be lower and the cohort match may fall back to the conservative "combined" cohort.

3. Cohort matching

Once we know your prompt's structure, we match it to a cohort in our blind-judged benchmark and return that cohort's percentile range (p25 / median / p75) as the savings estimate.

The benchmark

Per-cohort statistics

CohortSavings rangeQuality (avg eq)Sample
Instruction-heavy 21.9% – 32.9% (median 28.0%) 4.22 / 5 50 prompts × 4
Conversation history 31.3% – 39.7% (median 35.5%) 3.53 / 5 50 prompts × 4
Mixed structure 7.0% – 10.3% (median 8.7%) 4.16 / 5 50 prompts × 4
RAG-heavy data pending data pending benchmark in progress
RAG cohort honesty. Our RAG benchmark batch is being expanded. Until that data lands, RAG-classified prompts fall back to the "combined" cohort as a conservative proxy. Real RAG savings are typically higher than what we show. We will update this page when the new benchmark batch ships.
Judge bias honesty. Our current judge is GPT-4o evaluating responses from all four providers. The OpenAI cohort scores higher on equivalence (mean 4.78/5) than other providers — this could reflect real quality differences, OR it could reflect a same-family judging bias where GPT-4o favors GPT-4o-shaped output.

We are not certain which. Cross-judge validation (Claude judging the OpenAI-compressed sample, and vice versa) is on the roadmap. Until that data lands, treat per-provider quality differences with appropriate skepticism. The savings numbers are not affected by this — they are deterministic per compressed prompt.

4. The compression engine never runs in this flow

This is the most important architectural decision. The Ordica Prompt Analyzer never invokes our compression engine on your input. Every savings number you see comes from looking up your prompt's structure in the benchmark cohort table — not from compressing your prompt and measuring the result.

There are two reasons:

5. Privacy contract

What we log

What we never log

What leaves this server

Verify with the audit endpoint — it shows the policy and the last 20 (content-free) request entries.

6. Limitations

7. Performance and out-of-scope

Typical analyzer latency: 200-800 ms end-to-end. The local tokenizers (OpenAI, Grok) return in under 1 ms. The remote tokenizers (Claude, Gemini) typically respond in 150-700 ms depending on prompt size and round-trip time. Total cost to you: zero.

Out of scope: This tool does not measure KV cache impact or end-to-end inference latency. Provider-side prompt caching (Anthropic, OpenAI) interacts with compression in ways the analyzer cannot evaluate from a single request. If KV cache economics are central to your workload, validate compression in a Shadow Mode deployment before adopting.

8. Verify the claims

Don't trust this page. Verify it.

Pull the cohort data yourself

curl -s https://ordica.ai/api/analyze/cohorts | jq .

# Or grab just one cohort's stats:
curl -s https://ordica.ai/api/analyze/cohorts \
  | jq '.cohorts.instruction_compression.savings_pct'

The cohort table is the source of truth. Everything the analyzer returns is derived from it. If you find a number on this site that doesn't match the cohort table, file a bug.

The benchmark math

"200 prompts × 4 providers" expands to: 200 unique prompts get compressed once, producing 200 deterministic savings measurements (the compressed prompt is the same regardless of which provider receives it). Each compressed prompt is then sent to all 4 providers and judged blindly against the original, producing 800 quality validations. So the math is 200 savings measurements + 800 quality measurements = 1,000 total data points.

← Back to the analyzer