The $3.50/Month Fix: How We Went From 41% System Health by Experimenting With AI Models

2026-03-08 · MoneyMachine

Date: March 8, 2026 Author: Jeff (written with AI assistance) Project: MoneyMachine — building an autonomous revenue-generating agent swarm, in public

The Problem Statement

After building our observability system (previous post), we had hard data: 41% overall system health. Two of our six agents were completely broken, a third was underperforming, and we were burning ChatGPT Pro tokens on agents that could run cheaper elsewhere. Here’s where each agent stood:

Agent	Model	Health	Problem
Adrian (CEO)	ChatGPT Pro / Codex 5.3	78%	Healthy, errors from wrong workspace paths
Scout	Ollama / qwen3.5:35b	65%	Functional but repetitive ENOENT errors
Revenue Ops	ChatGPT Pro / Codex 5.3	50%	Half of sessions produce zero tool calls
Site Builder	ChatGPT Pro / Codex 5.3	72%	Healthy
Domain Analyst	Ollama / llama3.3:70b	~0%	Cannot make tool calls (hallucinates them as text)
Content Writer	Ollama / qwen3-coder:30b	0%	Crashes on any error, 0/9 productive sessions

Three agents on ChatGPT Pro ($200/month subscription), two on local Ollama (free but broken), one on local Ollama (free, functional but suboptimal). We needed to fix the broken agents without blowing our budget.

This is the story of how we researched, evaluated, and deployed a new model strategy that aims to bring the system to 90%+ health for about $3.50/month in additional costs. More importantly, it’s about building a framework for ongoing model experimentation — because the right model today won’t be the right model in three months.

The Constraint: Running 24/7 on a Budget

Our agents run on cron schedules, 24/7:

Adrian: every 30 minutes
Scout: every 2 hours + daily deep scan
Revenue Ops: every 3 hours
Domain Analyst: every 4 hours
Site Builder: every 3 hours
Content Writer: every 4 hours

That’s roughly 48+ agent sessions per day across the fleet. If every session averages 15K input tokens and 5K output tokens, we’re looking at significant throughput. The ChatGPT Pro subscription handles Adrian and Site Builder at $200/month (flat rate, usage-based allocation), but putting all six agents on Pro would burn through the allocation in days.

We also run three agents on a ThinkPad P16 laptop with 16GB of VRAM via Ollama. The VRAM constraint means only one model fits at a time — when multiple agents are scheduled close together, Ollama swaps models in and out, adding 5-10 minutes per session. And if the laptop goes offline (travel, updates, power loss), those agents go dark.

So we needed: cheap, reliable, tool-calling-capable models with cloud availability as a fallback.

The Research: Four Options Evaluated

Option A: Second ChatGPT Pro Subscription ($200/month)

The simplest option: just buy another subscription.

Why we rejected it:

$400/month total is a lot when we’re generating $0 in revenue
OpenAI’s terms are unclear about running multiple subscriptions for automated agent workloads
Doesn’t solve the fundamental problem — we’d still hit usage limits with 6 agents at 24/7
All eggs in one basket (one provider, one payment method)

Verdict: Expensive and risky.

Option B: Ollama Cloud ($20-100/month)

Ollama launched a cloud offering with tiered pricing: Free (rate-limited), $20/month, $100/month (production).

Why we rejected it:

Still early-stage, reliability unclear for 24/7 production use
The problem isn’t where the models run — it’s which models we’re running
llama3.3:70b hallucinates tool calls whether it runs locally or in the cloud
qwen3-coder:30b crashes on errors regardless of infrastructure
Moving broken models to the cloud just means paying for broken models

Verdict: Doesn’t fix the root cause.

Option C: OpenRouter (Pay-per-token, massive model selection)

OpenRouter provides a unified API that routes to hundreds of models from dozens of providers. One API key, one endpoint, access to DeepSeek, Google, Anthropic, Meta, and more. Pricing is per-token with no subscriptions.

Why we chose it:

OpenClaw has native OpenRouter support: model format openrouter/<provider>/<model>, env var OPENROUTER_API_KEY
Access to models specifically known for strong tool calling
Pay only for what we use — critical at $0 revenue
Can switch models per-agent without infrastructure changes
Built-in fallback: if one model is down, we can instantly switch to another

Option D: Direct API (DeepSeek, Google, Anthropic)

Going directly to each provider’s API instead of through OpenRouter.

Why we used OpenRouter instead:

Would need separate API keys, billing, and configuration for each provider
OpenRouter markup is minimal (usually <5% over direct API pricing)
The unified interface means model swaps are a one-line config change
OpenClaw’s OpenRouter integration is well-documented and tested

The Model Selection: Matching Models to Roles

This is where the research got deep. Not all models are created equal for agent work, and “good at chat” doesn’t mean “good at tool calling.”

What Makes a Good Agent Model?

For OpenClaw agents, a model must:

Generate proper tool call format. Not hallucinate tool calls as text. Not output raw JSON. Actually invoke the function-calling API correctly.
Handle tool errors gracefully. When a file doesn’t exist, don’t crash. Try a different path, create the file, or report the issue and continue.
Follow complex multi-step instructions. Read DIRECTIVES.md, pick a task, break it into steps, execute each step with tools, report results.
Stay on task across long sessions. 10-30 minute sessions with dozens of tool calls. No losing context, no reverting to “how can I help you?” mode.

What We Tested and Learned

Models that FAIL at OpenClaw agent work:

llama3.3:70b — Hallucinates tool calls as text. The model understands it should use tools but generates the JSON as a chat message instead of a function call. This is a well-documented issue with Llama 3.3’s tool calling implementation. No amount of prompt engineering fixes this because it’s a model architecture issue with how tool calls are generated.
qwen3-coder:30b — Crashes on first error. This model was trained primarily for code generation, not tool-calling agent workflows. When a tool returns an error (like file not found), the model doesn’t know how to recover. It produces an empty response and the session dies. 0% success rate in 9 sessions.
qwen3.5:35b — Technically works but suboptimal. Tool calls succeed but the model sometimes generates overly verbose reasoning before acting, and occasionally fails to parse complex tool results. Good enough for Scout’s simple research tasks but not reliable for critical workflows.

Models we evaluated as replacements:

Model	Provider	Price (in/out per 1M)	Tool Calling	Best For
DeepSeek V3.2	OpenRouter	$0.25 / $0.40	Excellent	Structured tasks, analysis, coding
Gemini 2.5 Flash	OpenRouter	$0.30 / $2.50	Good	Content writing, long-form, 1M context
Gemini 2.5 Flash Lite	OpenRouter	$0.10 / $0.40	Good	Budget option for simpler tasks
Claude Sonnet 4	OpenRouter	$3.00 / $15.00	Excellent	Premium tasks, complex reasoning
Qwen 3 32B	Local Ollama	Free	Good	Local fallback, reliable tool calling

The Decision Matrix

For each agent, we asked:

What does this agent actually do? (Research? Write? Analyze? Build?)
How critical is tool-calling reliability? (Every agent: very)
How important is output quality? (Content Writer: critical. Domain Analyst: moderate.)
What’s our budget tolerance? (Near zero until revenue starts)

Final Model Assignments

Agent	Old Model	New Model	Monthly Est. Cost	Rationale
Adrian	Codex 5.3 (ChatGPT Pro)	No change	$0 (subscription)	CEO coordination needs top-tier reasoning
Scout	qwen3.5:35b (local)	qwen3:32b (local)	$0	Better tool calling than 3.5, proven reliability
Revenue Ops	Codex 5.3 (ChatGPT Pro)	DeepSeek V3.2 (OpenRouter)	~$1.50/mo	Frees ChatGPT Pro capacity for Adrian. DeepSeek excels at structured data/analysis
Domain Analyst	llama3.3:70b (local, broken)	DeepSeek V3.2 (OpenRouter)	~$1.13/mo	Was 0% functional. Now actually makes tool calls
Site Builder	Codex 5.3 (ChatGPT Pro)	No change	$0 (subscription)	Code generation quality matters, keep on best model
Content Writer	qwen3-coder:30b (local, broken)	Gemini 2.5 Flash (OpenRouter)	~$0.95/mo	Was 100% broken. Gemini strong at long-form SEO content

Total additional cost: ~$3.50/month. Expected health improvement: 41% to 90%+.

The Implementation

Configuration (5 minutes)

OpenClaw’s OpenRouter integration made this trivial:

Add OPENROUTER_API_KEY to the env section of openclaw.json
Change agent model strings from ollama/llama3.3:70b to openrouter/deepseek/deepseek-v3.2
Update the content writer cron job model override
Add qwen3:32b to the Ollama models list for Scout
Restart the gateway

No code changes. No new infrastructure. No SDK upgrades. Five config edits and a restart.

Fallback Configuration

We also added a global fallback model. If any agent’s primary model fails — including Scout’s local Ollama if the ThinkPad goes offline — the system automatically falls back to DeepSeek V3.2 via OpenRouter. This means:

ThinkPad offline? Scout keeps working via OpenRouter (~$0.25/M tokens until it’s back)
DeepSeek has an outage? Agents will retry and the cron scheduler handles transient failures
OpenRouter down? Adrian and Site Builder continue on ChatGPT Pro, others wait for recovery

Smoke Tests (All Passed)

Before relying on cron schedules, we ran each agent once manually:

Domain Analyst (was: hallucinating tool calls, now: DeepSeek V3.2):

“Hello. Files I can confirm in the workspace: AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, BOOTSTRAP.md, memory/2026-03-08.md”

It read its files. It listed them. It actually used tools. First time in 24 hours.

Content Writer (was: crashing on every error, now: Gemini 2.5 Flash):

“Hello! My current task is to create landing page copy for aitoolchamp.com, including a hero headline, subhead, value proposition, and CTA, and save it as landing-page-copy.md in the workspace/drafts/ directory.”

It read its directives. It understood its task. It didn’t crash. Also a first.

Revenue Ops (was: 50% empty sessions on Codex 5.3, now: DeepSeek V3.2):

“Hello. DIRECTIVES.md says I’m Revenue Ops in Phase 1, tasked with building financial tracking infrastructure. Tools available: read, write, edit, grep, find, ls, apply_patch, exec…”

Identified its role, listed its tools. Clean execution.

The Experiment Framework: How We’ll Iterate

This model migration isn’t a one-time fix — it’s the start of a continuous experimentation process. Here’s the framework we’re building:

1. Measure Everything

Revenue Ops now tracks OpenRouter API costs alongside all other expenses:

Token usage per agent per day
Cost per session (input + output tokens * model pricing)
Daily/weekly/monthly burn rate
Cost vs. health score (are we paying for quality?)

2. Set Thresholds

Health score floor: 80% per agent (below = investigate)
Cost ceiling: $20/month for OpenRouter (above = optimize or downgrade)
Session productivity: >70% of sessions must produce file changes

3. Experiment Protocol

When considering a model change:

Run 5 manual test sessions with the candidate model
Check: Does it make tool calls correctly? Handle errors? Follow directives?
Deploy to one non-critical agent for 24 hours
Compare health scores and output quality vs. the previous model
If better: roll out. If worse or same: revert.

4. Models to Watch

The AI model landscape moves fast. Models we’re tracking for future evaluation:

DeepSeek V3.2 Speciale ($0.40/$1.20) — premium tier of our current workhorse
Gemini 2.5 Flash Lite ($0.10/$0.40) — could replace DeepSeek for simpler tasks at half the cost
Qwen 3.5 updates — if Ollama-hosted models improve tool calling, we save API costs entirely
MiniMax M2.5 — extremely cheap ($0.000295/$0.0012), worth testing for basic tasks
Claude Sonnet — if we need premium quality for a specific agent, $3/$15 per 1M is still cheaper than a second ChatGPT Pro subscription

Cost Tracking: Revenue Ops’ New Job

We’ve updated Revenue Ops’ directives with explicit cost tracking requirements:

Track OpenRouter costs by agent and model
Calculate daily/weekly/monthly burn rates
Alert Adrian if monthly costs exceed $20
Compare actual vs. projected spend ($3-5/month target)
Include cost data in every cycle report

Adrian’s directives now include a cost monitoring mandate — he checks Revenue Ops’ P&L tracker on every heartbeat and flags anomalies.

The goal: our total infrastructure cost should be $200 (ChatGPT Pro) + $8 (Contabo VPS) + $3-5 (OpenRouter) = ~$213/month until we’re generating revenue. Once revenue starts, we can afford to upgrade models for better quality.

Key Lessons

1. The cheapest model that works is the right model.

DeepSeek V3.2 at $0.25/M input tokens does 90% of what Codex 5.3 does for agent tasks. The remaining 10% (complex reasoning, nuanced code generation) matters for Adrian and Site Builder but not for Domain Analyst or Revenue Ops.

2. “Big” doesn’t mean “capable.”

llama3.3:70b is a 70 billion parameter model. It’s impressive at many tasks. It literally cannot make a function call in OpenClaw. Meanwhile, qwen3:32b (less than half the size) handles tool calling reliably. Model size tells you almost nothing about task-specific capability.

3. Local models are free but not zero-cost.

Running three models on 16GB of VRAM means constant swapping, 30-minute timeouts, and thermal throttling. The hidden cost is time — sessions that take 15 minutes locally take 30 seconds via API. For agents that run 24/7, those minutes add up.

4. Fallbacks prevent outages, not just errors.

Our ThinkPad travels with us across Europe. It goes offline during transit, when Wi-Fi is bad, during Windows updates. Without a fallback, Scout goes dark for hours. With OpenRouter fallback, Scout keeps working — we just pay a small API cost until the laptop reconnects.

5. Experiment, don’t assume.

We assumed Codex 5.3 would be the best at everything because it’s the “flagship” model. But for content writing, it’s mediocre. For structured data analysis, DeepSeek matches it. For agent tool calling, qwen3:32b at 0 cost beats llama3.3:70b at 0 cost. The only way to know is to test.

What’s Next

Monitor health scores over the next 24 hours to see if the model migration delivers the expected improvement
Track actual OpenRouter costs vs. our $3-5/month estimate
Evaluate if Scout should move to OpenRouter permanently (eliminating the ThinkPad dependency entirely)
Test Gemini 2.5 Flash Lite as a cheaper alternative for Domain Analyst tasks
Consider a dedicated “model evaluator” agent that periodically tests new models and recommends upgrades

The model landscape changes weekly. New releases, price drops, capability improvements. Having an experimentation framework means we can ride those waves instead of being locked into yesterday’s choices.

This is Day 2 of building a revenue-generating AI agent swarm in public. For the observability story, see Day 2: Building Observability. For project overview, see the README.