Two Days Chasing a Codex Harness Bug: The Fix Was Bypassing It Entirely

2026-04-22 · MoneyMachine

The Escape Hatch Nobody Documents: Making OpenClaw’s Codex Harness Fall Back to Ollama

TL;DR: If you run OpenClaw agents on openai-codex/gpt-5.3-codex-spark (free via ChatGPT Pro), and you’ve hit the weekly cap, your fallback chain to ollama-cloud/glm-5.1 or openrouter/deepseek-v3.2 will silently die with CodexAppServerRpcError: Model provider not found. The Codex subprocess can’t see providers outside the Codex family. Issue #35220 was closed as “not planned.” But there’s an undocumented config (embeddedHarness.fallback: "pi") that might let the internal OpenClaw engine pick up where Codex dies. This post runs the experiment live.

Context: what broke

Four of my AI agents (Adrian the CEO, Builder, Release Engineer, QA Engineer) use openai-codex/gpt-5.3-codex-spark as their primary model. That’s the free-via-ChatGPT-Pro-subscription pathway — I pay $200/mo for ChatGPT Pro and the agents authenticate through it instead of burning OpenAI API credits. When the weekly cap hits, a fallback chain is supposed to take over:

openai-codex/gpt-5.3-codex-spark    ← primary (free via ChatGPT Pro)
  ↓ rate_limit
openai-codex/gpt-5.4                ← Codex fallback (also free)
  ↓ rate_limit
ollama-cloud/glm-5.1                ← cloud Ollama ($20/mo flat)
  ↓ error
openrouter/deepseek/deepseek-v3.2   ← API pay-as-you-go

On paper, this means the factory degrades gracefully: when the free tier runs out, it spends a few dollars on GLM-5.1 or DeepSeek and keeps shipping.

In practice, here’s what I watched happen in the gateway log this morning at 09:22:

[model-fallback/decision] decision=candidate_failed
  requested=openai-codex/gpt-5.3-codex-spark
  candidate=ollama-cloud/glm-5.1
  reason=unknown
  next=openrouter/deepseek/deepseek-v3.2
  detail=failed to load configuration: Model provider `ollama-cloud` not found

[diagnostic] lane task error: error="CodexAppServerRpcError:
  failed to load configuration: Model provider `openrouter` not found"

The Codex harness rejects every fallback outside its own provider family. ollama-cloud not found. openrouter not found. The agent fails completely. Release Engineer — the agent responsible for actually deploying products to Cloudflare Pages — has 81 approved products queued up and can’t ship any of them.

Why this is architectural, not a bug

OpenClaw has two execution paths for agents:

The internal pi engine — OpenClaw’s native model router. Knows about every provider you’ve configured (ollama-cloud, openrouter, arcee, everything). Handles fallback across providers cleanly.
The Codex harness — a subprocess that OpenClaw spawns to delegate execution to OpenAI’s Codex app-server. This is the only way to use free ChatGPT Pro tokens for agent runs. The subprocess is essentially a sandboxed Codex runtime with its own internal provider registry — and that registry only knows Codex providers (openai-codex/gpt-5.3-codex-spark, openai-codex/gpt-5.4, etc).

When the gateway’s fallback logic picks ollama-cloud/glm-5.1 as the next model and hands it to the Codex subprocess, the subprocess doesn’t know that handle. It returns the “provider not found” error. There is no bridge from the subprocess back out to the gateway’s provider registry.

This is a known limitation. Issue #35220 documents it as “Codex Responses API streaming server_error does not trigger model fallback.” It was closed as “not planned.” The maintainers made a deliberate architectural call: the Codex harness is Codex-only by design.

This makes sense if you’re a normal OpenClaw user — you’re either all-in on Codex (and you tolerate the cap) or all-in on API-key providers (and you skip Codex). But my cost architecture is specifically the hybrid: Codex for free tokens when available, cheap API providers as backstop. That’s the exact edge the upstream team didn’t design for.

The knob hiding in the docs

Buried in the Model Providers docs is a line about embeddedHarness.fallback:

Codex-only deployments can disable automatic PI fallback with agents.defaults.embeddedHarness.fallback: "none".

The phrase “automatic PI fallback” implies that PI fallback is the default. If you leave fallback unset, OpenClaw supposedly falls back from Codex to its internal engine automatically. But every example I can find (and my own config) explicitly sets fallback: "none", turning off the very thing that might save us.

The three documented values:

`fallback`	Behavior (as documented)
`"codex"`	Codex-only, fail hard if the subprocess dies
`"pi"`	Hand off to OpenClaw’s internal PI engine
`"none"`	Strict — no fallback at all

What’s unclear (and the reason nobody seems to have tried this): does "pi" mean the internal engine just retries the SAME model via a different code path, or does it fully take over and honor the agent’s model.fallbacks chain including external providers?

The docs don’t say. Issue #35220 was closed without touching this. Nobody on GitHub has posted about it.

So: experiment.

The experiment

Target: Release Engineer. Most blocked agent (81 approved products, 0 shipped). If "pi" doesn’t work, rolling back is one config line.

Before state:

{
  "id": "release-engineer",
  "model": {
    "primary": "openai-codex/gpt-5.3-codex-spark",
    "fallbacks": [
      "openai-codex/gpt-5.4",
      "ollama-cloud/glm-5.1",
      "openrouter/deepseek/deepseek-v3.2"
    ]
  },
  "embeddedHarness": {
    "runtime": "codex",
    "fallback": "none"    ← strict Codex-only, fails hard
  }
}

Change: flip "fallback": "none" → "fallback": "pi". Nothing else.

Procedure:

# Backup the config (paranoid)
sudo cp /home/agentops/.openclaw/openclaw.json \
        /home/agentops/.openclaw/openclaw.json.bak-$(date +%Y%m%d-%H%M%S)

# Make the surgical change (python to avoid JSON escaping hell)
sudo python3 -c "
import json
p = '/home/agentops/.openclaw/openclaw.json'
with open(p) as f: c = json.load(f)
for a in c.get('agents', {}).get('definitions', []):
    if a.get('id') == 'release-engineer':
        a['embeddedHarness']['fallback'] = 'pi'
json.dump(c, open(p,'w'), indent=2)
"

# Fix ownership (agentops owns the config)
sudo chown agentops:agentops /home/agentops/.openclaw/openclaw.json
sudo chmod 600 /home/agentops/.openclaw/openclaw.json

# Verify the diff
sudo diff <backup> <config>
# Expected: exactly one line changed

# Restart the gateway
sudo systemctl restart openclaw-gateway.service
until sudo ss -tlnp 2>/dev/null | grep -q ":18789"; do sleep 3; done

# Force-run release-engineer
sudo -u agentops openclaw cron run 04b2376f-d1ce-4d69-8c4e-cd56517b94f5

# Watch the gateway log for fallback decisions
sudo -u agentops tail -F /home/agentops/.openclaw/gateway.log \
  | grep -E "release-engineer|fallback decision|harness|pi"

What I’m watching for:

Does the gateway still come up clean after the config change? (Lowest bar — if the gateway won’t start, "pi" isn’t a valid value in my OpenClaw version.)
Does release-engineer’s cron execute without immediately failing?
When Codex hits rate_limit or 401, does the gateway log show a handoff to the PI engine (instead of just cascading through more Codex provider errors)?
Does PI pick up and actually talk to ollama-cloud or openrouter?
Do products actually ship to Cloudflare Pages?

Each of those is a separate potential failure point. Results below.

One complication worth naming: today, gpt-5.3-codex-spark is at its weekly cap, but gpt-5.4 has quota. So the native Codex-internal fallback (spark → gpt-5.4) should work even without "pi". That means:

If the agent reaches gpt-5.4 and succeeds, the product ships but we haven’t actually tested "pi" — we’ve just confirmed the Codex-internal chain works when auth is healthy.
If gpt-5.4 also fails (auth hiccup, cap, network), and "pi" kicks in and reaches ollama-cloud/glm-5.1 successfully, the experiment is fully validated.

Either way I learn something; the second outcome is the gold case. Even the first outcome moves the factory forward by letting Release Engineer ship.

Unplanned bonus — a third failure mode appeared. While I was staging the experiment, the gateway log showed a new error class from 11:29 CET: Error: codex app-server exited: code=1 signal=null. The Codex subprocess was crashing immediately on startup, not hitting rate limits or auth issues. If this pattern persists, the experiment gets cleaner: both release-engineer (test, fallback: "pi") and qa-engineer (control, fallback: "none") will fail Codex startup, and the only difference in their behavior will be what happens next. That’s the A/B I’d have had to engineer manually.

Results (live — being filled in as events happen)

Outcome 1 — Gateway restart

Config change at: 2026-04-21T11:44:13+02:00
Gateway restart at: 2026-04-21T11:44:43+02:00
Listener up at: 2026-04-21T11:44:46+02:00 (~3s)
Systemctl status: active
Result: ✅ "pi" is a valid config value; gateway parses and starts cleanly.

Outcome 2 — First cron run enqueues cleanly

Force-run at: 2026-04-21T11:46:27+02:00
Response: {"ok": true, "enqueued": true, "runId": "manual:...:1"}
[gateway] agent model: openai-codex/gpt-5.3-codex-spark — config still reads the Codex primary correctly
[gateway] ready (8 plugins...; 56.1s) — startup + 8 plugins loaded
[gateway] qmd memory startup initialization armed for agent "release-engineer" at 11:48:33
Result: ✅ Cron enqueues and the agent starts its qmd initialization without rejecting the new fallback: "pi" value.

Outcome 3 — Fallback behavior under Codex failure

What the cron history shows for release-engineer:

Pulling the run history from openclaw cron runs --id <id> is more revealing than tailing the live log. Every recorded run for the past ~4 hours — 52 consecutive errors — has the same terminal state:

{
  "status": "error",
  "error": "CodexAppServerRpcError: failed to load configuration: Model provider `openrouter` not found",
  "durationMs": 41867,     // ~40 seconds
  "consecutiveErrors": 52
}

The pattern: Codex subprocess starts, spark hits rate_limit, falls to gpt-5.4, gpt-5.4 fails with auth, falls to ollama-cloud/glm-5.1, Codex rejects it, falls to openrouter/deepseek-v3.2, Codex rejects it, run ends in ~40s with that final error in the history.

Post-experiment: the run that was supposed to test "pi" is still marked "runningAtMs": 1776764844708 in cron/jobs.json — 13 minutes in. No recorded termination yet. Previous failing runs all completed in under 4 minutes, so this is already 3× longer than any failure pattern I have data for. That could mean:

"pi" kicked in, the internal engine spawned a working model call to ollama-cloud/glm-5.1, and the agent is legitimately processing 81 inbox items. Release Engineer’s workload is non-trivial — per-product it has to read a YAML spec, run wrangler, update DNS, write a handoff file. If it got 5 minutes of actual work done before timing out, that’s evidence the fallback path works.
Or the run is silently hung in some new state and will eventually hit a lane-level timeout.

One observation that’s definitive either way: during startup, the gateway log threw lane wait exceeded: lane=nested waitedMs=400094 queueAhead=0 at 11:54:33 — the experimental run sat in a queue for 6m40s with nothing ahead of it. That means the stuck-state isn’t blocking on upstream work, it’s blocking on itself (most likely waiting for the Codex subprocess to respond to a handshake).

Outcome 4 — Does a product actually ship?

No. And not for the reason I expected.

The experimental run terminated at 11:55:01 after 7 minutes 8 seconds with:

[diagnostic] lane task error: lane=session:agent:release-engineer:cron:...
  durationMs=428104
  error="Error: codex app-server exited: code=1 signal=null"

This is not a provider-not-found error. This is the Codex subprocess itself exiting with status code 1 during startup. The failure happened before the model-call layer, which is where fallback: "pi" would engage. PI never got a chance to do anything because there was no Codex subprocess alive for it to step in behind.

There’s a critical distinction here that the docs don’t make clear:

Failure layer	`fallback: "pi"` help?
Model rate limit (spark hits cap)	✅ Yes — codex reports rate_limit, gateway tries next model, if last-in-chain fails then PI steps in
Model auth error (401 Unauthorized)	✅ Yes — same path as rate limit
Codex subprocess crash (exit code 1)	❌ No — no subprocess alive to invoke model fallback through
Codex subprocess hang (lane wait exceeded)	❓ Unknown — the lane timeout may or may not route to PI

My experiment hit case 3 and possibly case 4. Neither is the case "pi" is designed for.

What the gateway actually did on crash: milliseconds after the codex app-server exited: code=1 error, the log captured one fallback decision:

[model-fallback/decision] decision=candidate_failed
  requested=openai-codex/gpt-5.3-codex-spark
  candidate=openai-codex/gpt-5.3-codex-spark
  reason=unknown
  next=openai-codex/gpt-5.4
  detail=codex app-server exited: code=1 signal=null

Note what’s NOT here: no next=ollama-cloud/glm-5.1, no invocation of the PI engine, no handoff out of the Codex harness. The fallback went to gpt-5.4 — another Codex-family model that would try to spawn another Codex subprocess (which would also crash for the same underlying reason). And the outer lane timer had already fired its 428-second budget, so the run terminated before gpt-5.4 even got to spin up.

So: fallback: "pi" is not wired to fire on subprocess crashes, only on per-model errors inside a live subprocess. That’s an important nuance the docs leave implicit.

The real finding

The interesting artifact of this experiment wasn’t the "pi" setting — it was that the Codex subprocess has been failing for Release Engineer in two distinct ways over the past 4 hours:

52 consecutive “provider not found” errors (the bug I thought I was fixing) — these happen when the subprocess is alive but falls back to external providers it can’t resolve.
At least 3 fresh “subprocess exited code=1” errors — these happen when the subprocess itself won’t start at all.

Mode 1 is architectural — fixable with "pi" in principle. Mode 2 is environmental — possibly codex CLI version drift, maybe memory pressure from the 6-day-old session state, maybe an upstream Codex app-server bug.

Either way, for Release Engineer today, the pragmatic fix isn’t "pi". It’s: remove Release Engineer from the Codex harness entirely. Change its primary from openai-codex/gpt-5.3-codex-spark to ollama-cloud/glm-5.1, change embeddedHarness.runtime from "codex" to "auto" (which lets OpenClaw pick the native PI engine when the primary is non-codex), and lose the free-token benefit in exchange for actually shipping products.

Math: at ollama-cloud/glm-5.1 on Ollama Cloud Pro ($20/mo flat, no per-call cost), Release Engineer is free at the margin anyway. The “free via ChatGPT Pro” Codex benefit was valuable for Adrian and Builder (high-volume reasoning work). For Release Engineer, whose job is “read a YAML spec, run wrangler, write a handoff file” — Codex was probably overkill anyway. Glm-5.1 has 200K context and is perfectly capable of that work.

Decision

Keep fallback: "pi" on release-engineer’s config as a safety net for future model-layer failures. It’s harmless when inactive and might prevent a future outage.
Switch release-engineer’s primary to ollama-cloud/glm-5.1 and harness runtime to "auto". This bypasses Codex entirely and unblocks deploys today.
Keep Adrian + Builder on Codex — their workload justifies the free tokens, and both have fallback chains internal to Codex (spark → gpt-5.4) that work fine when the subprocess is healthy.
Revisit Codex subprocess crashes as a separate investigation — likely a bug-report-to-upstream or a codex-cli version fix, not something to work around in agent config.

What the experiment did prove

Even though the headline hypothesis (“"pi" saves us today”) didn’t pan out, four things became concrete:

embeddedHarness.fallback: "pi" is a valid config value in OpenClaw 2026.4.19-beta.2 — the gateway accepts it and the agent starts cleanly with it set.
The Codex subprocess failure mode I’d been battling for hours was actually two different bugs stacked on top of each other — provider-not-found (model layer) AND subprocess-exit-code-1 (process layer). They look similar in cron error counters; they need different fixes.
Cron state can get stuck with runningAtMs set if a run dies between the gateway-lane layer and the cron-scheduler layer. Fix: openclaw cron disable <id>; openclaw cron enable <id> clears the stuck flag.
A counter like consecutiveErrors: 52 should have been visible on the dashboard. When a single agent errors identically fifty-two times in a row, that’s a five-alarm signal buried in the cron store. The factory’s observability has a gap.

That last one is going on task #59 (dashboard session + fallback-usage metrics).

Side finding — 52 consecutive errors, zero Telegram alerts

Before the experiment, release-engineer had failed 52 consecutive times with the same error message. Nothing on Telegram. Nothing on the dashboard alert strip. The factory’s 81-item deploy backlog was entirely invisible in routine monitoring. A fresh pair of eyes from the gateway log dump was the only way this surfaced.

Lesson: “consecutive errors” is exactly the kind of metric you want on the dashboard. If a single agent errors 52 times in a row on identical error text, that’s ringing a bell. Planning to add this to task #59 (dashboard session-length + fallback-usage metrics).

What this means for anyone running this architecture

If you’re running Codex-harness agents with external-provider fallbacks, here’s the hierarchy of what actually matters:

First, make sure your Codex subprocess can even start. If you’re seeing codex app-server exited: code=1, nothing else matters — no fallback mechanism can rescue a process that won’t boot. Check your codex CLI version, your auth state, your memory pressure, the session-file sizes the agent is pulling in.
Second, decide whether Codex is worth it per-agent. For heavy reasoning work, yes — the free ChatGPT-Pro tokens are meaningful. For “read YAML, call API, write file” agents like Release Engineer, an Ollama Cloud or OpenRouter agent is probably simpler, cheaper at the margin, and just as effective.
Third, if you do want Codex + external fallbacks, embeddedHarness.fallback: "pi" is the knob to reach for. It’s valid, it’s cheap to try, it doesn’t break anything when it’s not triggered. But it only helps with model-layer failures inside a running subprocess — not subprocess crashes.

The upstream fixes worth filing:

Document embeddedHarness.fallback: "pi" in the provider docs. Today it’s only visible via an aside about disabling it with "none". The positive-case semantics aren’t explained.
Surface consecutiveErrors on the dashboard or emit Telegram alerts when it exceeds a threshold. 52 silent failures is an observability hole.
Disambiguate “subprocess crash” vs “model error” in the cron run log. Right now they both show as CodexAppServerRpcError, which makes diagnosis harder than it needs to be.

Update — 2026-04-22 evening: the real fix

I came back to this 31 hours after the first attempt. The ChatGPT Pro weekly cap was still exhausted, release-engineer had accumulated 77 consecutive identical errors, and still zero Telegram alerts.

First I upgraded OpenClaw from 2026.4.19-beta.2 to 2026.4.21. The changelog explicitly included “clean cron state files” (fixes our stuck runningAtMs state) and “safer session pruning” (fixes the unbounded-session bug from Sunday’s outage). Upgrade was clean — the jobs.json split into jobs.json + jobs-state.json during startup, no manual migration required.

The upgrade did NOT fix the subprocess crash. First force-run after upgrade, same error: codex app-server exited: code=1 signal=null in 36 seconds. This ruled out “it’s a bug fixed in a later version” as a theory.

At that point the remaining option was ripping Codex off release-engineer entirely. Three config changes:

model.primary: openai-codex/gpt-5.3-codex-spark → ollama-cloud/glm-5.1
model.fallbacks: Codex-family → ["openrouter/deepseek-v3.2", "openrouter/google/gemini-2.5-flash"]
embeddedHarness.runtime: "codex" → "auto" (lets OpenClaw pick native PI when primary isn’t Codex)
Kept embeddedHarness.fallback: "pi" for the safety net

Restarted gateway. Force-triggered release-engineer.

Five minutes later: session file 40KB, 32 messages, 28 real LLM turns, successful exec tool calls returning exitCode: 0. Agent is actively running. No Codex subprocess spawning. No crash. It’s just… working.

The fix wasn’t to make Codex work better. It was to take Codex away from the agent that didn’t need it.

The architectural lesson

When I set up release-engineer I put it on openai-codex/gpt-5.3-codex-spark reflexively, because that’s Adrian’s primary and Adrian is the CEO agent. Same free-tokens-via-ChatGPT-Pro economics. Seemed sensible.

But release-engineer’s job is: read a YAML file, call wrangler, write a handoff note. It does not need the Codex harness’s agentic loop. It does not need strict-agentic mode. It does not need reasoning-heavy inference. It’s a glorified deploy script wrapper. Routing it through a sandboxed Codex subprocess was operational overhead masquerading as architectural consistency.

A smaller principle generalizes from this: match the harness to the workload, not the pattern. The free tokens aren’t worth much if the harness crashes every run.

What I’m not doing — migrating frameworks

Two days of pain is enough time to start Googling “OpenClaw alternatives.” I did. I looked at Hatchet, Temporal, Inngest, Restate, DBOS Transact, and NousResearch’s Hermes Agent. Wrote up a full evaluation with migration cost estimates.

The punchline of that evaluation: migrating now would be chasing shiny objects. Every alternative has its own version of these bugs. Hermes has a critical open state.db corruption issue that mirrors our session-bloat pain exactly. Temporal’s self-host complexity is wrong for our single-VPS scale. Hatchet IS the right answer if we ever migrate — but the migration itself is a 2-3 week distraction at $0 revenue, and it doesn’t solve the problem for Adrian who legitimately benefits from Codex.

The right answer today is a 2-line config change on one agent plus a /usr/local/bin/mm-check-consecutive-errors cron that fires a Telegram alert when any agent hits 10 identical errors in a row. The latter is the real lesson — the observability gap is what let this become a two-day incident. A framework swap would paper over the surface; the alert closes the gap.

Calendar reminder for 2026-05-22: revisit this decision. If the monitoring is live and consecutive-error alerts have fired exactly zero times in the intervening month, we stay. If they’ve fired more than twice, we pull the trigger on a Hatchet two-tier migration.

Builder-in-public is sometimes “I found a hidden config knob” and sometimes “I tried four theories over two days before the obvious one worked.” This one was the second kind. Publishing both.

Why I’m posting this

Most content about tools like OpenClaw is marketing (“look how many providers we support!”) or bug reports that never close. The real operational knowledge — which undocumented knob does what under degraded conditions — lives in Slack DMs and never reaches anyone.

If you’re running a hybrid Codex + external-provider architecture on OpenClaw, this post is the thing I wish I’d found this morning at 08:00 when I was watching my release pipeline fail in a new way. Take what you need.

Experiment in progress. This post will update with the actual outcomes.