The $3.50/Month Fix: How We Went From 41% System Health by Experimenting With AI Models
2026-03-08 · MoneyMachine
Date: March 8, 2026 Author: Jeff (written with AI assistance) Project: MoneyMachine — building an autonomous revenue-generating agent swarm, in public
The Problem Statement
After building our observability system (previous post), we had hard data: 41% overall system health. Two of our six agents were completely broken, a third was underperforming, and we were burning ChatGPT Pro tokens on agents that could run cheaper elsewhere. Here’s where each agent stood:
| Agent | Model | Health | Problem |
|---|---|---|---|
| Adrian (CEO) | ChatGPT Pro / Codex 5.3 | 78% | Healthy, errors from wrong workspace paths |
| Scout | Ollama / qwen3.5:35b | 65% | Functional but repetitive ENOENT errors |
| Revenue Ops | ChatGPT Pro / Codex 5.3 | 50% | Half of sessions produce zero tool calls |
| Site Builder | ChatGPT Pro / Codex 5.3 | 72% | Healthy |
| Domain Analyst | Ollama / llama3.3:70b | ~0% | Cannot make tool calls (hallucinates them as text) |
| Content Writer | Ollama / qwen3-coder:30b | 0% | Crashes on any error, 0/9 productive sessions |
Three agents on ChatGPT Pro ($200/month subscription), two on local Ollama (free but broken), one on local Ollama (free, functional but suboptimal). We needed to fix the broken agents without blowing our budget.
This is the story of how we researched, evaluated, and deployed a new model strategy that aims to bring the system to 90%+ health for about $3.50/month in additional costs. More importantly, it’s about building a framework for ongoing model experimentation — because the right model today won’t be the right model in three months.
The Constraint: Running 24/7 on a Budget
Our agents run on cron schedules, 24/7:
- Adrian: every 30 minutes
- Scout: every 2 hours + daily deep scan
- Revenue Ops: every 3 hours
- Domain Analyst: every 4 hours
- Site Builder: every 3 hours
- Content Writer: every 4 hours
That’s roughly 48+ agent sessions per day across the fleet. If every session averages 15K input tokens and 5K output tokens, we’re looking at significant throughput. The ChatGPT Pro subscription handles Adrian and Site Builder at $200/month (flat rate, usage-based allocation), but putting all six agents on Pro would burn through the allocation in days.
We also run three agents on a ThinkPad P16 laptop with 16GB of VRAM via Ollama. The VRAM constraint means only one model fits at a time — when multiple agents are scheduled close together, Ollama swaps models in and out, adding 5-10 minutes per session. And if the laptop goes offline (travel, updates, power loss), those agents go dark.
So we needed: cheap, reliable, tool-calling-capable models with cloud availability as a fallback.
The Research: Four Options Evaluated
Option A: Second ChatGPT Pro Subscription ($200/month)
The simplest option: just buy another subscription.
Why we rejected it:
- $400/month total is a lot when we’re generating $0 in revenue
- OpenAI’s terms are unclear about running multiple subscriptions for automated agent workloads
- Doesn’t solve the fundamental problem — we’d still hit usage limits with 6 agents at 24/7
- All eggs in one basket (one provider, one payment method)
Verdict: Expensive and risky.
Option B: Ollama Cloud ($20-100/month)
Ollama launched a cloud offering with tiered pricing: Free (rate-limited), $20/month, $100/month (production).
Why we rejected it:
- Still early-stage, reliability unclear for 24/7 production use
- The problem isn’t where the models run — it’s which models we’re running
- llama3.3:70b hallucinates tool calls whether it runs locally or in the cloud
- qwen3-coder:30b crashes on errors regardless of infrastructure
- Moving broken models to the cloud just means paying for broken models
Verdict: Doesn’t fix the root cause.
Option C: OpenRouter (Pay-per-token, massive model selection)
OpenRouter provides a unified API that routes to hundreds of models from dozens of providers. One API key, one endpoint, access to DeepSeek, Google, Anthropic, Meta, and more. Pricing is per-token with no subscriptions.
Why we chose it:
- OpenClaw has native OpenRouter support: model format
openrouter/<provider>/<model>, env varOPENROUTER_API_KEY - Access to models specifically known for strong tool calling
- Pay only for what we use — critical at $0 revenue
- Can switch models per-agent without infrastructure changes
- Built-in fallback: if one model is down, we can instantly switch to another
Option D: Direct API (DeepSeek, Google, Anthropic)
Going directly to each provider’s API instead of through OpenRouter.
Why we used OpenRouter instead:
- Would need separate API keys, billing, and configuration for each provider
- OpenRouter markup is minimal (usually <5% over direct API pricing)
- The unified interface means model swaps are a one-line config change
- OpenClaw’s OpenRouter integration is well-documented and tested
The Model Selection: Matching Models to Roles
This is where the research got deep. Not all models are created equal for agent work, and “good at chat” doesn’t mean “good at tool calling.”
What Makes a Good Agent Model?
For OpenClaw agents, a model must:
- Generate proper tool call format. Not hallucinate tool calls as text. Not output raw JSON. Actually invoke the function-calling API correctly.
- Handle tool errors gracefully. When a file doesn’t exist, don’t crash. Try a different path, create the file, or report the issue and continue.
- Follow complex multi-step instructions. Read DIRECTIVES.md, pick a task, break it into steps, execute each step with tools, report results.
- Stay on task across long sessions. 10-30 minute sessions with dozens of tool calls. No losing context, no reverting to “how can I help you?” mode.
What We Tested and Learned
Models that FAIL at OpenClaw agent work:
-
llama3.3:70b — Hallucinates tool calls as text. The model understands it should use tools but generates the JSON as a chat message instead of a function call. This is a well-documented issue with Llama 3.3’s tool calling implementation. No amount of prompt engineering fixes this because it’s a model architecture issue with how tool calls are generated.
-
qwen3-coder:30b — Crashes on first error. This model was trained primarily for code generation, not tool-calling agent workflows. When a tool returns an error (like file not found), the model doesn’t know how to recover. It produces an empty response and the session dies. 0% success rate in 9 sessions.
-
qwen3.5:35b — Technically works but suboptimal. Tool calls succeed but the model sometimes generates overly verbose reasoning before acting, and occasionally fails to parse complex tool results. Good enough for Scout’s simple research tasks but not reliable for critical workflows.
Models we evaluated as replacements:
| Model | Provider | Price (in/out per 1M) | Tool Calling | Best For |
|---|---|---|---|---|
| DeepSeek V3.2 | OpenRouter | $0.25 / $0.40 | Excellent | Structured tasks, analysis, coding |
| Gemini 2.5 Flash | OpenRouter | $0.30 / $2.50 | Good | Content writing, long-form, 1M context |
| Gemini 2.5 Flash Lite | OpenRouter | $0.10 / $0.40 | Good | Budget option for simpler tasks |
| Claude Sonnet 4 | OpenRouter | $3.00 / $15.00 | Excellent | Premium tasks, complex reasoning |
| Qwen 3 32B | Local Ollama | Free | Good | Local fallback, reliable tool calling |
The Decision Matrix
For each agent, we asked:
- What does this agent actually do? (Research? Write? Analyze? Build?)
- How critical is tool-calling reliability? (Every agent: very)
- How important is output quality? (Content Writer: critical. Domain Analyst: moderate.)
- What’s our budget tolerance? (Near zero until revenue starts)
Final Model Assignments
| Agent | Old Model | New Model | Monthly Est. Cost | Rationale |
|---|---|---|---|---|
| Adrian | Codex 5.3 (ChatGPT Pro) | No change | $0 (subscription) | CEO coordination needs top-tier reasoning |
| Scout | qwen3.5:35b (local) | qwen3:32b (local) | $0 | Better tool calling than 3.5, proven reliability |
| Revenue Ops | Codex 5.3 (ChatGPT Pro) | DeepSeek V3.2 (OpenRouter) | ~$1.50/mo | Frees ChatGPT Pro capacity for Adrian. DeepSeek excels at structured data/analysis |
| Domain Analyst | llama3.3:70b (local, broken) | DeepSeek V3.2 (OpenRouter) | ~$1.13/mo | Was 0% functional. Now actually makes tool calls |
| Site Builder | Codex 5.3 (ChatGPT Pro) | No change | $0 (subscription) | Code generation quality matters, keep on best model |
| Content Writer | qwen3-coder:30b (local, broken) | Gemini 2.5 Flash (OpenRouter) | ~$0.95/mo | Was 100% broken. Gemini strong at long-form SEO content |
Total additional cost: ~$3.50/month. Expected health improvement: 41% to 90%+.
The Implementation
Configuration (5 minutes)
OpenClaw’s OpenRouter integration made this trivial:
- Add
OPENROUTER_API_KEYto theenvsection ofopenclaw.json - Change agent model strings from
ollama/llama3.3:70btoopenrouter/deepseek/deepseek-v3.2 - Update the content writer cron job model override
- Add
qwen3:32bto the Ollama models list for Scout - Restart the gateway
No code changes. No new infrastructure. No SDK upgrades. Five config edits and a restart.
Fallback Configuration
We also added a global fallback model. If any agent’s primary model fails — including Scout’s local Ollama if the ThinkPad goes offline — the system automatically falls back to DeepSeek V3.2 via OpenRouter. This means:
- ThinkPad offline? Scout keeps working via OpenRouter (~$0.25/M tokens until it’s back)
- DeepSeek has an outage? Agents will retry and the cron scheduler handles transient failures
- OpenRouter down? Adrian and Site Builder continue on ChatGPT Pro, others wait for recovery
Smoke Tests (All Passed)
Before relying on cron schedules, we ran each agent once manually:
Domain Analyst (was: hallucinating tool calls, now: DeepSeek V3.2):
“Hello. Files I can confirm in the workspace: AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, BOOTSTRAP.md, memory/2026-03-08.md”
It read its files. It listed them. It actually used tools. First time in 24 hours.
Content Writer (was: crashing on every error, now: Gemini 2.5 Flash):
“Hello! My current task is to create landing page copy for aitoolchamp.com, including a hero headline, subhead, value proposition, and CTA, and save it as landing-page-copy.md in the workspace/drafts/ directory.”
It read its directives. It understood its task. It didn’t crash. Also a first.
Revenue Ops (was: 50% empty sessions on Codex 5.3, now: DeepSeek V3.2):
“Hello. DIRECTIVES.md says I’m Revenue Ops in Phase 1, tasked with building financial tracking infrastructure. Tools available: read, write, edit, grep, find, ls, apply_patch, exec…”
Identified its role, listed its tools. Clean execution.
The Experiment Framework: How We’ll Iterate
This model migration isn’t a one-time fix — it’s the start of a continuous experimentation process. Here’s the framework we’re building:
1. Measure Everything
Revenue Ops now tracks OpenRouter API costs alongside all other expenses:
- Token usage per agent per day
- Cost per session (input + output tokens * model pricing)
- Daily/weekly/monthly burn rate
- Cost vs. health score (are we paying for quality?)
2. Set Thresholds
- Health score floor: 80% per agent (below = investigate)
- Cost ceiling: $20/month for OpenRouter (above = optimize or downgrade)
- Session productivity: >70% of sessions must produce file changes
3. Experiment Protocol
When considering a model change:
- Run 5 manual test sessions with the candidate model
- Check: Does it make tool calls correctly? Handle errors? Follow directives?
- Deploy to one non-critical agent for 24 hours
- Compare health scores and output quality vs. the previous model
- If better: roll out. If worse or same: revert.
4. Models to Watch
The AI model landscape moves fast. Models we’re tracking for future evaluation:
- DeepSeek V3.2 Speciale ($0.40/$1.20) — premium tier of our current workhorse
- Gemini 2.5 Flash Lite ($0.10/$0.40) — could replace DeepSeek for simpler tasks at half the cost
- Qwen 3.5 updates — if Ollama-hosted models improve tool calling, we save API costs entirely
- MiniMax M2.5 — extremely cheap ($0.000295/$0.0012), worth testing for basic tasks
- Claude Sonnet — if we need premium quality for a specific agent, $3/$15 per 1M is still cheaper than a second ChatGPT Pro subscription
Cost Tracking: Revenue Ops’ New Job
We’ve updated Revenue Ops’ directives with explicit cost tracking requirements:
- Track OpenRouter costs by agent and model
- Calculate daily/weekly/monthly burn rates
- Alert Adrian if monthly costs exceed $20
- Compare actual vs. projected spend ($3-5/month target)
- Include cost data in every cycle report
Adrian’s directives now include a cost monitoring mandate — he checks Revenue Ops’ P&L tracker on every heartbeat and flags anomalies.
The goal: our total infrastructure cost should be $200 (ChatGPT Pro) + $8 (Contabo VPS) + $3-5 (OpenRouter) = ~$213/month until we’re generating revenue. Once revenue starts, we can afford to upgrade models for better quality.
Key Lessons
1. The cheapest model that works is the right model.
DeepSeek V3.2 at $0.25/M input tokens does 90% of what Codex 5.3 does for agent tasks. The remaining 10% (complex reasoning, nuanced code generation) matters for Adrian and Site Builder but not for Domain Analyst or Revenue Ops.
2. “Big” doesn’t mean “capable.”
llama3.3:70b is a 70 billion parameter model. It’s impressive at many tasks. It literally cannot make a function call in OpenClaw. Meanwhile, qwen3:32b (less than half the size) handles tool calling reliably. Model size tells you almost nothing about task-specific capability.
3. Local models are free but not zero-cost.
Running three models on 16GB of VRAM means constant swapping, 30-minute timeouts, and thermal throttling. The hidden cost is time — sessions that take 15 minutes locally take 30 seconds via API. For agents that run 24/7, those minutes add up.
4. Fallbacks prevent outages, not just errors.
Our ThinkPad travels with us across Europe. It goes offline during transit, when Wi-Fi is bad, during Windows updates. Without a fallback, Scout goes dark for hours. With OpenRouter fallback, Scout keeps working — we just pay a small API cost until the laptop reconnects.
5. Experiment, don’t assume.
We assumed Codex 5.3 would be the best at everything because it’s the “flagship” model. But for content writing, it’s mediocre. For structured data analysis, DeepSeek matches it. For agent tool calling, qwen3:32b at 0 cost beats llama3.3:70b at 0 cost. The only way to know is to test.
What’s Next
- Monitor health scores over the next 24 hours to see if the model migration delivers the expected improvement
- Track actual OpenRouter costs vs. our $3-5/month estimate
- Evaluate if Scout should move to OpenRouter permanently (eliminating the ThinkPad dependency entirely)
- Test Gemini 2.5 Flash Lite as a cheaper alternative for Domain Analyst tasks
- Consider a dedicated “model evaluator” agent that periodically tests new models and recommends upgrades
The model landscape changes weekly. New releases, price drops, capability improvements. Having an experimentation framework means we can ride those waves instead of being locked into yesterday’s choices.
This is Day 2 of building a revenue-generating AI agent swarm in public. For the observability story, see Day 2: Building Observability. For project overview, see the README.