Methodology

This page explains exactly how the Ordica Prompt Analyzer works. Read it in full before trusting the numbers it returns. We assume you are an AI engineer who will verify our claims independently — and we want to make that verification easy.

1. Tokenization

Each model uses its real, official tokenizer. We do not approximate GPT-4o, Claude, or Gemini token counts.

Model	Tokenizer	Source	Where it runs
GPT-4o	tiktoken cl100k_base	Official OpenAI tokenizer	Local (this server)
Claude Sonnet 4.5	Anthropic count_tokens	Official Anthropic API	Anthropic's servers
Grok-4	cl100k_base (approximation)	OpenAI tokenizer used as proxy	Local
Gemini 2.5 Flash	countTokens REST API	Official Google API	Google's servers

Grok approximation honesty. xAI does not publish a tokenizer or expose a count_tokens endpoint. We use cl100k_base as a proxy because Grok models are GPT-architecture-derived. Real Grok counts typically differ from our reported number by 1-3%. If exact accuracy matters for your Grok workload, validate with a live API call.

2. Section parsing

We try two paths to identify the structural sections of your prompt (system instructions, few-shot examples, RAG context, tool definitions, user queries, conversation history).

Structured path: if your prompt parses as JSON and matches the OpenAI or Anthropic message-list format, we use the role field on each message to classify exactly. This is the most reliable path.
Heuristic path: if your prompt is raw text, we use regex matching against well-known structural markers ("System:", "<<SYS>>", "Document N:", "Example:", etc.) to slice the text into sections.

Each analysis reports the parsing method used and a confidence score from 0 to 1. If your prompt has ambiguous structure, the confidence will be lower and the cohort match may fall back to the conservative "combined" cohort.

3. Cohort matching

Once we know your prompt's structure, we match it to a cohort in our blind-judged benchmark and return that cohort's percentile range (p25 / median / p75) as the savings estimate.

The benchmark

Sample size: 200 unique prompts × 4 providers = 800 quality validations
Categories: instruction-heavy, conversation history, mixed structure, RAG-heavy
Judge: LLM-as-judge (GPT-4o), blind to which response was the compressed version
Metrics: token savings percentage, response equivalence (1-5), per-response quality (1-5)
Public data: view the full cohort table

Per-cohort statistics

Cohort	Savings range	Quality (avg eq)	Sample
Instruction-heavy	21.9% – 32.9% (median 28.0%)	4.22 / 5	50 prompts × 4
Conversation history	31.3% – 39.7% (median 35.5%)	3.53 / 5	50 prompts × 4
Mixed structure	7.0% – 10.3% (median 8.7%)	4.16 / 5	50 prompts × 4
RAG-heavy	data pending	data pending	benchmark in progress

RAG cohort honesty. Our RAG benchmark batch is being expanded. Until that data lands, RAG-classified prompts fall back to the "combined" cohort as a conservative proxy. Real RAG savings are typically higher than what we show. We will update this page when the new benchmark batch ships.

Judge bias honesty. Our current judge is GPT-4o evaluating responses from all four providers. The OpenAI cohort scores higher on equivalence (mean 4.78/5) than other providers — this could reflect real quality differences, OR it could reflect a same-family judging bias where GPT-4o favors GPT-4o-shaped output.

We are not certain which. Cross-judge validation (Claude judging the OpenAI-compressed sample, and vice versa) is on the roadmap. Until that data lands, treat per-provider quality differences with appropriate skepticism. The savings numbers are not affected by this — they are deterministic per compressed prompt.

4. The compression engine never runs in this flow

This is the most important architectural decision. The Ordica Prompt Analyzer never invokes our compression engine on your input. Every savings number you see comes from looking up your prompt's structure in the benchmark cohort table — not from compressing your prompt and measuring the result.

There are two reasons:

IP protection. Our compression methodology stays behind the paid product. This page exists to demonstrate competence, not to give away the engine.
Honesty about cohort estimates. A "we ran our engine on your specific prompt and it saved exactly 28.4%" claim would be misleading. What we can honestly say is "prompts that look like yours have historically saved 22-30% in our blind-judged benchmark."

5. Privacy contract

What we log

Timestamp (UTC)
Salted+hashed IP, first 16 hex chars (salt rotates daily)
Models you requested
Total token count
Matched cohort name
Whether the fallback was used
Parser method and confidence score
Request duration in milliseconds

What we never log

Your prompt content
Section text
Your identity, raw IP, cookies, session tokens
Any field that could be used to reconstruct what you pasted

What leaves this server

OpenAI / Grok tokenization: entirely local (tiktoken). Your prompt does not leave this server for these counts.
Claude tokenization: sends your prompt to Anthropic's count_tokens API for tokenization only. Anthropic's data policy applies.
Gemini tokenization: sends your prompt to Google's countTokens REST API for tokenization only. Google's data policy applies.
Ordica's compression engine: never invoked in this flow. Stays behind the paid product.

Verify with the audit endpoint — it shows the policy and the last 20 (content-free) request entries.

6. Limitations

RAG cohort data is incomplete; RAG-classified prompts use the combined cohort as a conservative proxy.
Quality preservation varies by provider; lower-fidelity categories (conversation history) should be validated before production use.
Estimates are directional. Real savings on a specific prompt depend on its exact structure and content.
Tokenization for Grok uses cl100k_base as approximation; xAI does not publish a tokenizer.
Section parsing is heuristic and may misclassify prompts that use unusual or custom markers. The structure score reports parser confidence.

7. Performance and out-of-scope

Typical analyzer latency: 200-800 ms end-to-end. The local tokenizers (OpenAI, Grok) return in under 1 ms. The remote tokenizers (Claude, Gemini) typically respond in 150-700 ms depending on prompt size and round-trip time. Total cost to you: zero.

Out of scope: This tool does not measure KV cache impact or end-to-end inference latency. Provider-side prompt caching (Anthropic, OpenAI) interacts with compression in ways the analyzer cannot evaluate from a single request. If KV cache economics are central to your workload, validate compression in a Shadow Mode deployment before adopting.

8. Verify the claims

Don't trust this page. Verify it.

The full cohort data: /api/analyze/cohorts
The methodology metadata: /api/analyze/methodology
The audit log policy and recent entries: /api/analyze/audit
OpenAI tokenizer: pip install tiktoken; tiktoken.get_encoding("cl100k_base").encode("…")
Anthropic count_tokens: official docs
Gemini countTokens: official docs

Pull the cohort data yourself

curl -s https://ordica.ai/api/analyze/cohorts | jq .

# Or grab just one cohort's stats:
curl -s https://ordica.ai/api/analyze/cohorts \
  | jq '.cohorts.instruction_compression.savings_pct'

The cohort table is the source of truth. Everything the analyzer returns is derived from it. If you find a number on this site that doesn't match the cohort table, file a bug.

The benchmark math

"200 prompts × 4 providers" expands to: 200 unique prompts get compressed once, producing 200 deterministic savings measurements (the compressed prompt is the same regardless of which provider receives it). Each compressed prompt is then sent to all 4 providers and judged blindly against the original, producing 800 quality validations. So the math is 200 savings measurements + 800 quality measurements = 1,000 total data points.

← Back to the analyzer