Six months after I started building AI-powered applications seriously, my API bill was three times what it needed to be.
Not because I was doing anything wrong. Because I was doing everything with the same model. GPT-4 for summarization. GPT-4 for classification. GPT-4 for generating a three-word category label for a support ticket. The model that handled my hardest tasks was also handling my most trivial ones, at the same cost per token, thousands of times a day.
The fix wasn’t finding a single better model. It was building a decision framework that matched each task to the model that handled it best at the right cost. That framework now saves me a meaningful amount every month on API costs while actually improving output quality on specific tasks — because it turned out the most expensive model wasn’t always the best one for every job.
Here’s the full decision matrix.
The Four Dimensions That Drive Every Choice
Before getting to specific models, the framework starts with four questions about every task. The answers determine which models are even candidates before any other consideration applies.
What does failure cost? A wrong answer in a customer-facing legal summary has different consequences than a wrong category label on an internal ticket nobody reads twice. High-stakes tasks narrow toward more capable, more reliable models regardless of cost. Low-stakes tasks open the full range including cheaper and faster options.
How long is the input? Some tasks involve short prompts. Others involve entire codebases, legal documents, or long conversation histories. Context window requirements eliminate candidates before capability even enters the conversation.
Is the user waiting? Real-time interactions where users watch a cursor blink have latency requirements that background processing jobs don’t. Speed narrows the candidates independently of capability and cost.
How often does this run? A task that runs twice a day has a negligible cost difference between model tiers. The same task running fifty thousand times a day has a cost difference that determines whether the feature is financially viable.
These four questions produce a shortlist. The following breakdown of each model family tells you which tasks each handles distinctively well.
GPT-4o — The Default That Earns Its Default Status
GPT-4o is the model I reach for when I don’t have a strong reason to reach for something else — and that’s less often than it used to be as the alternatives have matured.
Where it consistently performs best in my experience: tasks requiring broad general knowledge across domains, creative generation where quality variation matters, multimodal tasks combining vision and text, and anything where the output needs to match a specific style or tone that’s hard to specify explicitly. It’s also the model with the most mature ecosystem — more tools, more examples, more community knowledge about how to get the best out of it.
The practical strength that keeps it as my default for certain tasks is consistency. On tasks where I’ve compared GPT-4o against alternatives, the output quality variance is lower — the worst outputs are less bad, even when the average isn’t meaningfully different. For production features where edge case quality matters as much as average quality, that consistency has value.
Where I’ve stopped using it: long document processing where Claude’s context handling is better, coding tasks where it’s not meaningfully better than cheaper alternatives, high-volume simple tasks where cost matters more than marginal quality differences, and structured output tasks where I’ve found other models more reliably follow schema constraints.
The specific task type where GPT-4o is my unambiguous first choice: Multimodal tasks requiring visual reasoning combined with complex text generation. Processing an image and producing nuanced written analysis of it. Describing visual content for accessibility with appropriate detail and tone. These tasks play to GPT-4o’s training in ways alternatives haven’t fully matched in my testing.
Claude — The Model That Actually Reads the Whole Thing
Claude is the model I reach for when the context is long, when nuance matters, or when the task requires careful reasoning that benefits from thinking through implications rather than generating plausible-sounding output quickly.
The context window is the most immediately practical differentiator. Claude’s 200,000 token context window means I can pass an entire codebase, a complete legal document, or a long conversation history and ask questions about the whole thing. Tasks that require cross-referencing information across a long document — finding inconsistencies, synthesizing themes, answering questions that require reading the whole thing before any piece makes sense — work qualitatively better in Claude than in models with smaller context windows where the document has to be chunked.
The reasoning quality on complex analytical tasks is the second differentiator I’ve noticed in practice. For tasks involving multiple considerations that need to be weighed against each other — architectural decisions, nuanced writing that needs to balance competing concerns, analysis where jumping to a conclusion early produces worse results than thinking through the problem carefully — Claude produces more thorough reasoning and catches more edge cases in my testing.
Instruction following on long, complex prompts is the third area. When a prompt has multiple constraints that all need to be satisfied simultaneously — produce JSON in this schema, follow this style guide, don’t mention these specific things, include these specific elements — Claude tracks all of them more reliably than alternatives in my experience.
Where I don’t default to Claude: high-volume simple tasks where cost matters, tasks where speed is more important than reasoning depth, and tasks where the output is short enough that the context window advantage doesn’t apply.
The task type where Claude is my unambiguous first choice: Analyzing large codebases or documents and answering specific questions about them. Reviewing an entire repository for security vulnerabilities. Summarizing a year’s worth of customer feedback threads. Understanding a long conversation history and extracting specific patterns. These tasks need the full context, and Claude handles the full context better than any alternative I’ve tested.
Gemini — The Underused Model With Specific Strengths
Gemini is the model most developers in my orbit underuse, partly because Google’s API ecosystem was slower to mature and partly because the early versions were genuinely behind. The current versions aren’t.
Gemini 1.5 Pro’s context window is competitive with Claude at one million tokens — significantly larger for the specific use cases that need it. For tasks involving very long documents or very large codebases where even Claude’s 200,000 tokens isn’t enough, Gemini 1.5 Pro is the serious alternative.
Where I’ve found Gemini specifically strong: tasks involving structured data reasoning, particularly when that data is in formats like tables or spreadsheets where the model needs to understand structure as well as content. Tasks requiring integration with Google Workspace data where the native integration reduces friction. Multilingual tasks where Gemini’s training includes strong coverage of non-English languages.
The cost profile at scale is competitive. For high-volume tasks where GPT-4o would be expensive, Gemini 1.5 Flash — the smaller, faster version — handles many tasks adequately at a fraction of the cost. I use it for tasks that need reasonable quality at high volume where Claude or GPT-4o would be cost-prohibitive.
Where I don’t default to Gemini: tasks where I have extensive prompt engineering invested in another model’s behavior and don’t want to re-optimize, tasks where the ecosystem tooling matters and GPT-4o’s ecosystem advantage is relevant, and tasks where I’ve empirically tested Gemini and found it behind on the specific output quality I need.
The task type where Gemini is my first choice: Very long document processing beyond Claude’s context window, tasks with native Google Workspace integration, and high-volume tasks where Gemini Flash’s speed and cost profile beats alternatives on the economics without sacrificing the quality the task actually requires.
Open Source Models — When Control Matters More Than Convenience
Open source models — Llama, Mistral, Qwen, and the growing ecosystem of fine-tuned variants — aren’t always the right choice. When they are the right choice, they’re dramatically the right choice.
The case for open source starts with data privacy. API calls to commercial models send your data to external servers. For tasks involving sensitive customer data, proprietary information, or anything subject to data residency requirements, running a model locally or on infrastructure you control eliminates a category of risk that commercial APIs introduce regardless of their privacy policies.
The second case is cost at extreme volume. A task that runs ten million times per day at commercial API pricing is often economically unviable. The same task running on self-hosted open source infrastructure, once you factor in the actual compute cost, is often viable. The crossover point depends on the specific task and model, but it exists for most high-volume production use cases.
The third case is customization. Fine-tuning an open source model on your specific domain, your specific output format, your specific style — and owning that fine-tuned model completely — produces a specialized tool that outperforms general-purpose commercial models on your specific task. The upfront investment in fine-tuning and serving infrastructure pays back in quality and cost at scale.
Where open source models genuinely fall short: tasks requiring the reasoning capability of the largest commercial models, tasks where the development speed advantage of mature commercial APIs matters, and tasks where the infrastructure burden of self-hosting isn’t worth the control it provides.
The task types where I default to open source: High-volume classification and extraction tasks where a fine-tuned smaller model beats a general-purpose commercial model on the specific task. Anything involving sensitive data that can’t leave controlled infrastructure. Tasks where I want to fine-tune on proprietary data and own the result.
The Decision Matrix in Practice
The framework collapses to a sequence of questions applied to every new task.
Does this involve sensitive data that can’t leave controlled infrastructure? If yes — open source, self-hosted. The answer to this question overrides everything else.
Does this require more than 200,000 tokens of context? If yes — Gemini 1.5 Pro. Context window requirements eliminate candidates before capability comparisons apply.
Is this a high-volume simple task — classification, extraction, formatting, short generation? If yes — evaluate Gemini Flash or a fine-tuned open source model first. Cost at volume makes the economics of capable-but-expensive models unsustainable for simple tasks.
Does this require deep reasoning over a long document, complex instruction following, or nuanced analysis? If yes — Claude. This is where the reasoning depth and context handling justify the cost.
Does this involve vision combined with complex text generation, or creative output where quality consistency matters? If yes — GPT-4o.
Is this a medium-complexity task where the above don’t apply? Run a small comparison across Claude and GPT-4o on representative examples. Pick the one that produces better output for your specific task. Don’t assume — test.
The Testing Practice That Keeps This Framework Honest
Model capabilities change. A model that was behind six months ago may have improved. A model that led six months ago may have been surpassed.
Every quarter I run a benchmark of my highest-volume task types against current model versions. Not a general benchmark — a benchmark on my actual prompts with my actual data against my actual quality criteria. General benchmarks tell you how models perform on standardized tests. My benchmark tells me how they perform on my specific work.
The results regularly surprise me. Models I’d written off have caught up on specific tasks. Models I relied on have been surpassed by alternatives I hadn’t evaluated recently. The framework isn’t a fixed answer — it’s a structure for asking the question correctly and updating the answer when evidence changes.
The developers who are paying the right price for AI inference and getting the right quality for each task aren’t the ones who found the best model. They’re the ones who keep asking which model is best for each specific task and updating their answer when the landscape shifts.
That’s less glamorous than having a definitive answer. It’s also significantly cheaper and produces better results.
If this framework saved you from over-paying on a task you didn’t need your most expensive model for — follow for more. I write about the practical AI engineering decisions that determine whether building with AI is sustainable, not just possible.
Comments
Loading comments…