Conversation history: median 35.5%, max 39.7%. Long instructions: median 28.0%, max 32.9%. RAG retrieval: median 49.8%, max 55.3%. Mixed production traffic: 7–10%. Blind-tested across 200 prompts × 4 providers — ranges published by workload type, not averaged into a single marketing number.
median 35.5%, max 39.7%
Multi-turn chat histories carry the most redundancy. When we see a conversation that has grown past 2K tokens, the compression gain is the highest of any workload class we measure.
median 28.0%, max 32.9%
Long system prompts, detailed role definitions, and few-shot exemplars compress well. Typical result for production AI applications with heavy prompt scaffolding.
7–10%
When your prompts are already dense — short JSON payloads, minimal instructions, tight function-calling — savings are modest. We publish this. Your worst case is still a positive number.
median 49.8%, max 55.3%
⚠ Savings only. Blind quality judging for the RAG cohort is in progress — equivalence scores pending.
Retrieved document blocks show the highest savings of any workload class we measure. Savings validated on 50 prompts × 3 non-Gemini providers.
Token savings measured across 150 unique prompts (one deterministic savings number per prompt) plus 600 blind quality judgments across 3 completed cohorts; RAG cohort in progress will bring the totals to 200 / 800 / 1,000. Response quality was separately blind-judged on a 1–5 scale by a separate AI; mean equivalence is 4.38 on OpenAI, 3.96 on Anthropic, 3.78 on Gemini, and 3.76 on Grok. Full methodology →
Set one environment variable
OPENAI_BASE_URL or ANTHROPIC_BASE_URL
= https://api.ordica.ai
Keep your existing provider API key
We forward it to your chosen provider
and don't retain it after the request
Compressed when safe, passed through when not
Quality tracked against blind-judged equivalence scores, fewer tokens when we can
xAI users: the OpenAI SDK already works — same env var, your Grok API key. Gemini users: pass http_options={"base_url":"https://api.ordica.ai/gemini"} to the google-genai client, or contact us for deeper integration support.
Your prompts transit the Ordica proxy in-memory, get compressed, and are forwarded to the AI provider you chose. We do not store them, log them, read them, or train on them. The only data retained is anonymous counts: tokens sent, tokens saved, and which provider you used.
Billing runs on those counts alone. Prompt content is not persisted beyond the compression step — our audit logs contain metadata only: token counts, timestamps, billing meters. Retention and sub-processor details are enumerated in our Data Processing Agreement.
Optimized for each provider. Switch between them freely.
Every message is a real-world test. A separate judge model scores compressed vs. original responses without knowing which is which.
⚠ A single judge model (Claude Sonnet 4.5) scored all four target families. Same-family bias could inflate one family's scores relative to the others. Read the per-provider gaps with that caveat in mind — see methodology: judge bias honesty. Cross-judge validation is on the roadmap.
Blind-tested across 200 prompts × 4 providers: a separate AI judged compressed vs. original responses without knowing which was which. Equivalence is scored on a 1–5 scale, where 5 means indistinguishable and 4 means "substantively equivalent — a customer would accept either response." Mean equivalence above 3.5 on this scale indicates the compressed response reliably passes blind substitution. Scores cluster in the 3.5–4.6 range across providers because compression is probabilistic, not lossless — we publish the real distribution rather than round up. All four cohorts (instruction, history, mixed, RAG) are now quality-judged; the RAG cohort shows a cross-provider mean of 4.35 / 5.0 across 175 validations (Gemini ran at n=25 due to an API response parser fix tracked in our engineering backlog). Savings are consistent and repeatable for the same prompt under equivalent configuration. Grok token counts in our benchmark runs used a documented approximation — see methodology for per-model details.
All tests run on current production models. No cherry-picking. Full methodology and the per-prompt judge output: analyze/methodology.html.
Savings are consistent and repeatable for the same prompt under equivalent configuration. Grok token counts in our benchmark runs used a documented approximation — see methodology for per-model details. Figures below use the benchmark median (33.8% — RAG-inclusive, quality judging for the RAG cohort in progress) across four cohorts. The three-cohort validated median (excluding RAG) is a more conservative reference, and the calculator above defaults to the single-cohort Conservative figure (8.7%) for a lower-bound estimate — toggle the Blended view to apply this 33.8% median to your spend. Actual savings depend on workload mix — RAG-heavy prompts save more, mixed-structure prompts save less.
RAG cohort quality judging in progress — the median includes RAG savings, but blind equivalence scoring for that cohort is still pending.
| Monthly API spend | Conservative (p25, 14.6%) | Median (33.8%) | Optimistic (p75, 47.5%) |
|---|---|---|---|
| $1,000 /mo | $1,752 | $4,056 | $5,700 |
| $5,000 /mo | $8,760 | $20,280 | $28,500 |
| $25,000 /mo | $43,800 | $101,400 | $142,500 |
| $100,000 /mo | $175,200 | $405,600 | $570,000 |
| Savings rate | 14.6% (blended p25) | 33.8% (median) | 47.5% (blended p75) |
Results across 150 unique prompts (one deterministic savings number per prompt, 4 providers verifying each) with 600 blind-judged quality validations across 3 completed cohorts. RAG cohort in progress will bring totals to 200 prompts / 800 quality judgments.
Conversation history: 31.3–39.7% (median 35.5%). Instruction-heavy workloads: 21.9–32.9% (median 28.0%). RAG retrieval context: 42.8–55.3% (median 49.8%) — savings validated, blind quality judging in progress. Dense structured traffic: 7.0–10.3% (median 8.7%). Savings depend on your workload mix and are consistent and repeatable for the same prompt under equivalent configuration. For Grok, our benchmark runs used a documented approximation — see methodology for per-model details. Quality preservation varies by provider.
⚠ RAG quality disclosure: The RAG cohort savings figures above are validated on 50 prompts × 3 non-Gemini providers. Blind-judged quality equivalence scoring for the RAG cohort is still in progress. Quality parity with the other three cohorts has not yet been independently confirmed.
Your billing dashboard shows how many tokens you sent, how many tokens we compressed them to, and what that saved you in dollars. Your bill is 30% of that dollar savings on Pro, 20% on Enterprise, zero on Free. No compression, no fee.
No. Your messages pass through the proxy to reach the AI provider. Our audit and billing logs contain metadata only — token counts, timestamps, provider, and billing meters. Prompt and response content is not persisted beyond the compression step. Retention windows, sub-processors, and data-subject rights are enumerated in our Data Processing Agreement.
No. Ordica reduces token count without changing what your prompt asks the AI to do. The compressed version produces equivalent responses — validated blind across 200 prompts × 4 providers with a separate AI judging equivalence on a 1–5 scale. Three-cohort means: 4.38 on OpenAI, 3.96 on Claude, 3.78 on Gemini, 3.76 on Grok. The RAG cohort cross-provider mean is 4.35 / 5.0 on 175 validations (OpenAI 4.22, Claude 4.22, Gemini 4.40 at n=25, Grok 4.60). If the engine cannot compress safely, your prompt goes through untouched — you never get a worse response. Per-provider quality scores are published in the methodology page.
The compression adds a few milliseconds — you won't notice it. Whether you're using GPT-4o, Claude, Gemini, or Grok, the provider's response time is what you feel, and that's unchanged. In some cases, shorter prompts actually get faster responses because the AI has less to process.
Ordica's compression is deterministic on our tested model corpus — the same input produces byte-identical output across repeated runs under equivalent configuration, so provider-side cache keys stay stable. In our verified test harness, a compressed prompt produced a full read-hit on the second call (Anthropic cache_read_input_tokens equal to cache_creation_input_tokens), confirming no interference on that path. Results on other prompt shapes are expected to hold but are measured per workload rather than universally asserted. If you're already using provider-side caching, you keep your cache hits.
For customers whose prompts already include cache_control markers: Ordica leaves those markers untouched. Your existing caching setup works exactly as before.
The system is fail-safe by design. If compression cannot be applied confidently, your prompt passes through untouched — you just don't save tokens on that message. Under normal operation the compressed prompt reaches the model and responses should match the original within the equivalence bounds we publish; if the engine can't compress safely, you get the uncompressed path rather than a degraded one. Worst case is zero savings, not worse quality.
Billing runs on token counts, not estimates. For every request, we record the token count of the prompt you sent and the token count of the prompt we forwarded to the provider — both measured against the provider's tokenizer reference. The difference is your savings; your bill is a fixed percentage of the dollar value of that difference. Your Stripe receipt shows the metered total for the billing period, and the savings analytics in your account show the per-request counts that produced it. Anything we cannot measure, we do not charge for: if compression is skipped on a request (pass-through), that request contributes zero to your bill. Enterprise customers can request a full billing audit export covering any billing period.
GPT-4o is a strong all-rounder — great for general questions, writing, and brainstorming. Claude is known for natural writing and careful, thoughtful responses. Gemini has a large context window and strong reasoning on structured problems. Grok is fast and conversational with less filtering. Try each and see which fits your workload.
Yes. The Free tier gives you the drop-in proxy endpoint — point your existing OpenAI or Anthropic SDK at Ordica with one environment variable — 10,000 requests per month, and a savings dashboard. No credit card required. You get the same compression engine as paid tiers. When you're ready for higher limits and the full optimization profile, Pro and Enterprise are there.
We charge a percentage of your measured savings — 30% on Pro, 20% on Enterprise. If compression saves you $100, you keep $70 (Pro) or $80 (Enterprise). If it saves you nothing, you pay nothing. Your dashboard shows every dollar in real time. No flat fees, no minimums, no surprises.
Your provider API key transits the Ordica proxy in-memory. We use it to authenticate with the provider you chose — OpenAI, Anthropic, xAI, or Google — on your behalf for that single request, then discard it when the request completes. We do not retain it, log it, store it in a database, or share it. The key is used only to reach the provider endpoint you designated; it never leaves that path.
Honest limitation: while your request is in flight, the key is held in process memory on our proxy layer — that's unavoidable for any middleware that forwards authenticated traffic. Ordica is not a secrets vault or KMS and does not claim to be. You hold the key in your code or secret manager the same way you do today, and you can rotate it anytime through your provider's dashboard if you ever want to cut access.
Routing your own API key through a middleware layer is a standard pattern — every major provider (OpenAI, Anthropic, xAI, Google) builds an API ecosystem around SDK wrappers, gateways, observability proxies, and caching layers. Ordica is structured to fit cleanly within that allowed pattern: you remain the authorized account holder, your requests go only to providers you designate, and we do not resell, pool, rent, or share API access across customers. We also do not retain prompt or response content after each request completes.
On that basis Ordica keeps you on the permitted side of each provider's current Terms. What providers actually prohibit (multi-tenant key pooling, API-key resale, unauthorized account sharing) is not what we do. Provider Terms vary by vendor and evolve over time — review your current provider's Terms to confirm compatibility with your specific workload. For enterprise documentation — DPAs, subprocessor list, data-flow diagrams — contact us.
We're onboarding customers in small batches. Request access and we'll get you set up.