This comprehensive guide provides strategies for reducing Claude API costs by 50-95% through proven optimization techniques. Implementing these methods can dramatically reduce your API spending while maintaining or improving performance.
| Model | Input (per MTok) | Output (per MTok) | Best Use Cases |
|---|---|---|---|
| Claude Haiku 4.5 | $1 | $5 | Classification, extraction, simple Q&A |
| Claude Sonnet 4.5 | $3 | $15 | Code generation, complex analysis, production workloads |
| Claude Opus 4.5 | $5 | $25 | Mission-critical decisions, complex reasoning |
Impact: 3-5x cost reduction
- Start with Haiku 4.5 - achieves 90% of Sonnet's performance at 1/3 the cost
- Use Sonnet 4.5 for balanced performance needs
- Reserve Opus 4.5 only for complex, mission-critical tasks
Impact: Up to 90% cost reduction on repeated content
- Cache write: 1.25x base price (5-min TTL) or 2x (1-hour TTL)
- Cache read: 0.1x base price (90% savings!)
- Break-even: Just 2 API calls
{
"model": "claude-sonnet-4-5-20250929",
"system": [{
"type": "text",
"text": "Your long system prompt or context here...",
"cache_control": {"type": "ephemeral"}
}],
"messages": [{"role": "user", "content": "Your question"}]
}- β System prompts and instructions
- β Large context documents (PDFs, codebases)
- β Tool definitions
- β Few-shot examples
- β Conversation history prefixes
Impact: 50% cost reduction for async workloads
Perfect for:
- Bulk content generation
- Large-scale data processing
- Non-time-sensitive analysis
- Training data preparation
Combine Batch + Caching for 95% savings!
- Write concise system prompts
- Truncate conversation history
- Include only relevant context
- Compress few-shot examples
- Set appropriate
max_tokenslimits - Request concise responses
- Use structured output (JSON)
- Implement stop sequences
- Classify request complexity (use Haiku)
- Route simple β Haiku, complex β Sonnet
- Reserve Opus for edge cases
- Implement confidence thresholds
- Use Sonnet for planning
- Dispatch subtasks to Haiku instances
- Parallel processing for efficiency
Monitor these metrics in API responses:
{
"usage": {
"input_tokens": 21,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 188086,
"output_tokens": 393
}
}High cache_read_input_tokens = caching working effectively!
- Switch simple tasks to Haiku (immediate 3-5x savings)
- Enable prompt caching for repeated system prompts
- Implement dynamic model routing
- Set appropriate max_tokens limits
- Use Batch API for non-urgent tasks
- Monitor usage patterns and optimize
This repository includes:
- Cost monitoring dashboard
- Dynamic model routing system
- Prompt caching utilities
- Batch processing tools
- Usage analytics
π‘ Pro Tip: Start with model selection and prompt caching - these alone can reduce costs by 80%+ while often improving response quality and speed.