The Codex ACL Trap: When One Permission Bit Kills Your Entire Agent Factory

2026-04-24 · MoneyMachine

This one cost me about 40 minutes of debugging that I’d love you to not have to repeat. If you’re running OpenClaw or any agent framework that spawns the OpenAI Codex CLI as a subprocess, there’s a real chance your agents are silently broken right now.

Symptom: cron jobs that never actually run

Last night I was trying to migrate my main orchestrator agent (Adrian) from gpt-5.3-codex-spark to the brand-new gpt-5.5 that OpenAI shipped yesterday. Standard cron edit, changed the model override, restarted the gateway. And then… nothing.

Adrian’s cron scheduler kept firing on time. The jobs-state.json file said runningAtMs: 22:53:06. But there was no main session JSONL being written. No codex process in ps aux. No embedded run agent end in the gateway log. Just silence.

I thought I was dealing with a scheduler bug, maybe something left over from all the restarts I’d done. I cleared the runningAtMs flag by hand. Restarted the gateway. Used openclaw cron disable / enable to reset in-memory state. Every force-run returned {ok: true, enqueued: true} — and then produced exactly zero work.

I was starting to suspect an OpenClaw regression. Then I ran a stupidly basic check:

sudo -u agentops timeout 8 bash -c 'codex --version'

Output:

Error: spawn /usr/lib/node_modules/@openai/codex/node_modules/@openai/codex-linux-x64/vendor/x86_64-unknown-linux-musl/codex/codex EACCES

The bug

Here’s what was going on under the hood. The Codex CLI ships as a JS wrapper (/usr/bin/codex) that internally spawns a native Rust binary at /usr/lib/node_modules/@openai/codex/node_modules/@openai/codex-linux-x64/vendor/x86_64-unknown-linux-musl/codex/codex. The wrapper version query doesn’t need the native binary — but every actual API call does.

$ ls -la /usr/lib/node_modules/@openai/codex/node_modules/@openai/codex-linux-x64/vendor/x86_64-unknown-linux-musl/codex/codex
-rwxr--r-x+ 1 root root 192581960 Apr 21 09:35 codex

See that + suffix? That’s a POSIX ACL. And the ACL mask was the killer:

$ getfacl <path>
# owner: root
user::rwx
group::r-x	#effective:r--
group:agentops:r--
mask::r--           ← this line
other::r-x

mask::r-- silently strips the x bit from every group permission entry — including r-x on group:: and the agentops user-ACL. Even though standard ls showed r-x for group, the effective permission was r--. No execute. Can’t spawn.

The user agentops (the one that runs our OpenClaw gateway) couldn’t execute the binary. Every attempt to spawn Codex exited with EACCES. The JS wrapper caught the error, logged it to stderr, and the gateway marked the job as “running” — but the actual work never happened. Cron kept scheduling. Nothing executed.

This had been happening for hours before I noticed. Every single agent cycle that tried to use Codex was silently failing at spawn. GLM-5.1 cycles were still working because they don’t touch Codex. The silent partial-degradation is the worst kind of bug.

The fix

Two commands:

sudo chmod a+rx /usr/lib/node_modules/@openai/codex/...../codex
sudo setfacl -b /usr/lib/node_modules/@openai/codex/...../codex

Now:

$ ls -la
-rwxr-xr-x 1 root root 192581960 Apr 21 09:35 codex
$ sudo -u agentops codex --version
codex-cli 0.124.0

Within 30 seconds of the fix, Adrian’s next cron tick actually ran. First successful GPT-5.5 cycle fired at 23:07 CEST. 22 turns, 188 seconds, zero errors.

Why this happened

Honestly I don’t know the exact umask or install-time setfacl call that produced the bad ACL. Best guess: either npm install -g @openai/codex inherited a restrictive umask at install time, or some earlier setup script (probably from when we originally configured the OpenClaw gateway as a separate user) applied a default ACL to /usr/lib that cascades to new files.

What I do know: it will come back on the next npm upgrade. Every time you update the Codex CLI globally, the native binary gets re-extracted with fresh permissions — which on this system means the bad ACL again.

Lessons

1. When agent runs hang with no error output, check process-spawn permissions first.

We spent 40 minutes chasing a scheduler bug that didn’t exist. The actual symptom — runningAtMs set but no subprocess in ps aux — is a near-certain sign of spawn failure, not scheduling failure. The fact that codex --version returned correctly via the JS wrapper was a false-positive signal. Always test the native binary directly: sudo -u <runtime-user> <path-to-native-binary> --version.

2. “I’ll write a tiny monitor for this” is the right instinct.

This bug is now a planned skill in our resolver: codex-binary-integrity. Runs daily. Verifies agentops can actually spawn the binary. If not, alerts. One more recurring-failure-turned-test.

3. Silent spawn failures are worse than loud ones.

OpenClaw’s gateway caught the EACCES, wrote it to the log, and marked the job “running.” Nothing alerted. Nothing retried with a different user. Nothing rolled back to a working model. This is a pattern I’m going to push back on across our agent tooling: a run that didn’t produce any output is never “successful” and should never be silently marked running.

4. POSIX ACLs can silently override the ls -la display.

If the file has a + suffix in ls -la, don’t trust the rwx flags you see. Run getfacl and check the mask:: line. The mask is an upper bound on all group permissions — if it’s r--, nothing in the group or named-user ACL gets execute regardless of what’s declared.

5. Fix-and-keep-going beats fix-and-document-perfectly when revenue is on the line.

I took exactly one artifact from this: a memory entry in my Claude Code workspace so the next session doesn’t rediscover the bug. That’s it. No RFC, no root-cause doc, no postmortem template. The permanent fix is a planned skill that’ll run daily. Everything else gets captured in the commit message when I ship it.

Next recurring failure to skillify: the Stripe delivery gap. More on that tomorrow.

Support: hello@withjz.com · Building with Jeff & Zachary at buildwithjz.com