Skip to main content

Reduce LLM costs by up to 50%

Edgee’s token compression runs at the edge before every request reaches LLM providers, automatically reducing prompt size by up to 50% while preserving semantic meaning and output quality. This is particularly effective for:
  • RAG pipelines with large document contexts
  • Long conversation histories in multi-turn agents
  • Verbose system instructions and formatting
  • Document analysis and summarization tasks

How It Works

Token compression happens automatically on every request through a four-step process:
1

Semantic Analysis

Analyze the prompt structure to identify redundant context, verbose formatting, and compressible sections without losing critical information.
2

Context Optimization

Compress repeated context (common in RAG), condense verbose formatting, and remove unnecessary whitespace while maintaining semantic relationships.
3

Instruction Preservation

Preserve critical instructions, few-shot examples, and task-specific requirements. System prompts and user intent remain intact.
4

Quality Verification

Verify the compressed prompt maintains semantic equivalence to the original. If quality checks fail, the original prompt is used.
Compression is most effective for prompts with repeated context (RAG), long system instructions, or verbose multi-turn histories. Simple queries may see minimal compression.

Understanding compression ratio

The compression ratio (sometimes called compression rate in APIs) is compressed size ÷ original size: how large the compressed prompt is relative to the original.
  • 0.9 (Light) = compressed prompt is 90% of the original length → ~10% fewer tokens
  • 0.7 (Strong) = compressed prompt is 70% of the original → ~30% fewer tokens (more aggressive)
In the console you choose Light (0.9), Medium (0.8), or Strong (0.7). The compressor aims for that ratio; the actual ratio per request may vary. Strong (0.7) asks for more compression; Light (0.9) is more conservative and keeps more of the original text.
Ratio vs reduction: Ratio = compressed/original (e.g. 0.75). Reduction = 1 − ratio (e.g. 25%). When we say “50% reduction,” that corresponds to a ratio of 0.50.

Semantic preservation and BERT score

To avoid changing the meaning of the prompt, we compare the compressed text to the original using BERT score (F1). It measures how semantically similar the two texts are on a scale of 0–1 (0%–100%).
  • Semantic preservation threshold (0–100%) is the minimum similarity we require. If the BERT score is below this threshold, we do not use the compressed prompt—we send the original instead, so quality is preserved.
  • In the console you choose Off (no check), Ultra Safe (0.95), Safe (0.85), or Edgy (0.75). Off = we always use the compressed prompt when compression runs; higher values = we only use the compressed prompt when it is very similar to the original; otherwise we fall back to the original.
This way you can allow aggressive compression (low ratio) while still guaranteeing that we never send a compressed prompt that is too different from what the user wrote.
In the Activity table, when we fell back to the original prompt because the similarity was below the threshold, the input token count is shown in red with a tooltip: “Didn’t match the semantic threshold – original prompt was used.”

Enabling Token Compression

Token compression can be enabled in three ways, giving you flexibility to control compression at the request, API key, or organization level:

1. Per Request (SDK)

Enable compression for specific requests using the SDK:
const response = await edgee.send({
  model: 'gpt-4o',
  input: {
    "messages": [
      {"role": "user", "content": "Your prompt here"}
    ],
    "enable_compression": true,
    "compression_rate": 0.8  // Target ratio: compressed = 80% of original (optional)
  }
});

2. Per API Key (Console)

Enable compression for specific API keys in your organization settings. This is useful when you want different compression settings for different applications or environments.
Enable compression for specific API keys
In the Edge Models section of your console:
  1. Toggle Enable token compression on
  2. Set Compression to Light (0.9), Medium (0.8), or Strong (0.7) — see Understanding compression ratio
  3. Set Semantic preservation threshold to Off, Ultra Safe (0.95), Safe (0.85), or Edgy (0.75) — see Semantic preservation and BERT score
  4. Under Scope, select Apply to specific API keys
  5. Choose which API keys should use compression

3. Organization-Wide (Console)

Enable compression for all requests across your entire organization. This is the recommended setting for most users to maximize savings automatically.
Enable compression organization-wide
In the Edge Models section of your console:
  1. Toggle Enable token compression on
  2. Set Compression to Light (0.9), Medium (0.8), or Strong (0.7)
  3. Set Semantic preservation threshold to Off, Ultra Safe (0.95), Safe (0.85), or Edgy (0.75)
  4. Under Scope, select Apply to all org requests
  5. All API keys will now use compression by default
Compression controls how aggressively Edgee compresses prompts: Strong (0.7) aims for more compression; Light (0.9) is more conservative. Medium (0.8) is the default. See Understanding compression ratio.
SDK-level configuration takes precedence over console settings. If you enable compression in your code with enable_compression: true, it will override the console configuration for that specific request.

When It Works Best

Token compression delivers the highest savings for these common use cases:

RAG Pipelines

40-50% reductionLarge document contexts with redundant information compress effectively. Ideal for Q&A systems, knowledge bases, and semantic search.

Long Contexts

30-45% reductionLengthy conversation histories, documentation, or background information. Common in chatbots and assistant applications.

Document Analysis

35-50% reductionSummarization, extraction, and analysis of long documents. Verbose source material compresses well.

Multi-Turn Agents

25-40% reductionConversational agents with growing context windows. Savings increase with conversation length.

Code Example

Every response includes compression metrics so you can track your savings:
import Edgee from 'edgee';

const edgee = new Edgee("your-api-key");

// Example: RAG Q&A with large context
const documents = [
  "Long document content here...",
  "Another document with context...",
  "More relevant information..."
];

const response = await edgee.send({
  model: 'gpt-4o',
  input: `Answer the question based on these documents:\n\n${documents.join('\n\n')}\n\nQuestion: What is the main topic?`,
  enable_compression: true, // Enable compression for this request
  compression_rate: 0.8, // Target ratio (0-1): 0.8 = compressed is 80% of original
});

console.log(response.text);

// Compression metrics
if (response.compression) {
  console.log(`Tokens saved: ${response.compression.saved_tokens}`);
  console.log(`Reduction: ${response.compression.reduction}%`);
  console.log(`Cost savings: $${(response.compression.cost_savings / 1_000_000).toFixed(4)}`);
  console.log(`Compression time: ${response.compression.time_ms}ms`);
}
Example output:
Tokens saved: 1,225
Reduction: 50%
Cost savings: $0.0061
Compression time: 14ms

Real-World Savings

Here’s what token compression means for your monthly AI bill:
Use CaseMonthly RequestsWithout EdgeeWith Edgee (50% compression)Monthly Savings
RAG Q&A (GPT-4o)100,000 @ 2,000 tokens$3,000$1,500$1,500
Document Analysis (Claude 3.5)50,000 @ 4,000 tokens$1,800$900$900
Chatbot (GPT-4o-mini)500,000 @ 500 tokens$375$188$187
Multi-turn Agent (GPT-4o)200,000 @ 1,000 tokens$3,000$1,500$1,500
Savings calculations use list pricing for GPT-4o (5/1Minputtokens),Claude3.5Sonnet(5/1M input tokens), Claude 3.5 Sonnet (3/1M input tokens), and GPT-4o-mini ($0.15/1M input tokens). Actual compression ratios vary by use case.

Best Practices

  • Structure RAG contexts with clear sections
  • Use consistent formatting in document chunks
  • Avoid excessive whitespace in system prompts
  • Group similar information together
  • Monitor compression.saved_tokens and compression.cost_savings across requests
  • Use compression.reduction to gauge effectiveness per request
  • Calculate cumulative savings weekly or monthly
  • Use observability tools to identify high-compression opportunities
  • Compare costs across different use cases
  • Enable compression by default for all requests
  • Compression happens automatically without configuration
  • Track compression.reduction to understand effectiveness (e.g. 48 = 48% fewer tokens)
  • Monitor compression.time_ms to ensure compression latency fits your SLA
  • Use response metrics to optimize prompt design
  • Use automatic model selection for additional savings
  • Route to cheaper models when appropriate
  • Compression + routing can reduce costs by 60-70% total
  • Monitor both compression and routing savings

Response Fields

Every Edgee response includes detailed compression metrics:
// Usage information
response.usage.prompt_tokens          // Compressed token count (billed)
response.usage.completion_tokens      // Output tokens (unchanged)
response.usage.total_tokens           // Total for billing calculation

// Compression information (when applied)
response.compression.saved_tokens     // Tokens saved by compression
response.compression.cost_savings     // Estimated cost savings in micro-units (e.g. 27000 = $0.027)
response.compression.reduction        // Percentage reduction (e.g. 48 = 48%)
response.compression.time_ms          // Time taken for compression in milliseconds
Use these fields to:
  • Track savings in real-time
  • Build cost dashboards and budgeting tools
  • Identify high-value compression opportunities
  • Optimize prompt design for maximum compression

What’s Next