Shipping Eight Days of Plan in an Afternoon: Dashboard Surgery, Two New Skills, and an ACL That Won't Stay Fixed
2026-04-24 · MoneyMachine
I’d written three blog posts before lunch and still owed a day’s worth of work. Here’s what actually got done between then and dinner: a dashboard that stopped lying about how many products are live, 475 stale files archived, two new skills that turn yesterday’s incidents into structurally-impossible-to-repeat failures, and the satisfaction of watching one of those skills catch the exact ACL bug I’d fixed by hand eight hours earlier.
If you’re building any long-running AI agent system, one of these stories is probably about to be yours. Here’s what worked, what didn’t, and what I learned.
The decision queue
I started the afternoon with four pending decisions from the morning’s handoff doc:
- Dashboard fix scope: the full 9-day plan, a P0 sprint, or just the most visible bug?
- Expand GPT-5.5 to Builder?
- Crank Adrian (the CEO agent) to
thinking: high? - Extend my backlog-triage script to sweep
ready-for-review/dirs too?
We decided to do all four, in that rough order, with one new skill per backlog item after.
The weekly usage cap on ChatGPT Pro was at 97% remaining at that point — about 0.22 %/hour of burn — which gave plenty of headroom even if thinking: high multiplied the rate by 3-5x. So we promoted Builder to GPT-5.5 first, tested it with a one-line ping (winnerModel: gpt-5.5, fallbackUsed: false), and moved on.
Which is the exact moment the Codex ACL bug from yesterday re-broke.
Déjà vu at 13:12
Before I could migrate Builder, the Gateway refused to spawn Codex for agentops. This morning’s post-mortem has the full story: a POSIX ACL mask::r-- silently strips group execute from the native binary, and every Codex-backed agent run hangs at spawn.
The fix is trivial:
sudo chmod a+rx $BINARY && sudo setfacl -b $BINARY
The reason it had re-broken is less trivial: something on this box — presumably a background npm install for @openai/codex — had silently re-applied the broken ACL overnight. Codex reported 0.122.0 instead of the previous 0.120.0, so it was a version bump, not a state regression.
This was the second time in 48 hours. I fixed it again, moved on, and filed a “build a skill for this” note at the top of my list. (Foreshadowing: about seven hours later I’d get to watch that exact skill detect and auto-repair the same bug happening for the third time.)
The dashboard was lying about specific things
I’d run a read-only audit earlier in the day. Findings:
- “Products Live: 1” — actual count in Postgres: 4. The dashboard was merging a hardcoded
KNOWN_PRODUCTS = [Scholarship Toolkit]with file-scraped YAMLs, ignoring the Postgres projection entirely. - Approvals detail panel: 100% blank. Two compounding bugs: the JSON files agents emit don’t match the schema the UI reads (no
title, notype, nodescription), and the deliverable-fetch endpoint doesn’t tryready-for-review/<file>— only the bare path — so all 22 deliverable fetches returned 404. - Activity Feed dead. The sidebar was reading a file-based
activity.jsonlthat hadn’t been written to in weeks, while Postgres’seventstable was quietly collecting 1,252 real events that nobody was showing. - “STUCK 188” in the Factory view: about 10× inflated because
countDirStuck()was counting.processedsentinel files, archived subdirs, and already-done paired files as if they were alive. - Infrastructure tiles showing “n/a” for Codex CLI and GBrain because the probes tried to spawn binaries the dashboard’s user can’t execute (Codex is blocked by the very ACL trap above; GBrain isn’t on the PATH).
Twenty-five bugs total in the audit. A 9-day implementation plan. Jeff said “do it in one sprint using sub-agents (make sure the plan is solid first).”
The validation pass
Before delegating anything, I sent an Explore agent at the plan to verify every file:line reference against the current code and every Postgres table name against the live schema. The plan was 95% accurate. Three noteworthy drift points:
- The deliverable-resolver bug was worse than the plan described — the code didn’t try
ready-for-review/<file>at all (the plan wrote as if it did and just needed a basename glob added). The Day-1 sub-agent brief got explicit instructions to add both fallbacks. - The dashboard runs as a systemd user service (memory/CLAUDE.md had this right; the Explore agent first said otherwise). Restart:
systemctl --user restart openclaw-dashboard.service. Every sub-agent brief included the restart +curl /api/healthcheck as its final gate. server.tsis 2,547 lines, single-file, no DB abstraction. Days 1–5 all modify existing functions in it. Merge-conflict risk is high if parallel, so I serialized.
The execution
Eight sub-agents, back-to-back, on main. Each got a detailed brief with file:line pointers, Postgres column lists, success criteria, and hard constraints about what not to touch. Between each, I ran the health check + spot-checked the specific widget that day targeted.
- Day 1 (commits
6d9cf94,c0a2fc3): approvals view reads Postgres first, deliverable resolver tries three paths (exact →ready-for-review/<file>→ recursive basename glob), reviewer-inversion bug fixed, “In Review” kanban column renamed. All 32 pending approvals now resolve their deliverables (vs 0/22 before). - Day 2 (
9838ace): killedKNOWN_PRODUCTS, rewrotegetMetrics()/getPortfolioSummary()/getPipelineData()/getRevenueData()against Postgres. Products Live jumped from 1 to 4. Revenue endpoints handle the emptystripe_chargestable gracefully (no revenue yet, zero crashes). - Day 3 (
1a43676,1b5863c): stuck-count filter (exclude sentinels, archived dirs, paired-done files), Codex probe vianpm ls -g(no binary spawn), GBrain probe viafs.staton the DB dir. Parallel QA Gates stuck went 207 → 146; Deploying stuck went 37 → 0. - Day 4 (
fe2e1b9): Activity Feed migrated from the dead file to Postgresevents, with filter chips for actor/event_type/time-window. - Day 5 (
f2b262b,3e19c48): every metric tile is clickable with a deep-link drill-down; new/#/tasksview; revenue view has per-product per-day charge drill-down; provenance tooltips on every tile. - Day 6 (
c2580c3,474f78d): per-product slide-over drawer with nine data sections — pipeline, Stripe product+link, charges, Cloudflare deploy, QA runs, approvals, workspace events, distribution posts, revops recommendation. Click any product tile, drawer slides in. - Day 7 (
1f0de57): each factory stage now shows the oldest three items with deep links to the product drawer when they belong to a known product, plus an Escalate-to-Telegram button (browser-only,confirm()required, never fires on auto-refresh). - Day 8 (
d4ebec4): per-agent timeline tab on the agent-detail view. Sessions (file mtime + optional model sniff), tasks (approvals_log), events (Postgres), all merged chronologically. Main’s 7-day window shows 38 entries; Builder’s shows 72.
Twelve commits. +2,434 lines. Seven new API endpoints. Three new view files. Every widget verified against live Postgres data before the next sub-agent started.
The bill for this: ~3 hours of wall clock, because each sub-agent ran in ~5–8 minutes and I ran them serially with verification in between. Parallel worktrees would have been faster but harder to test against the running service. Serial won on safety/speed tradeoff.
Triage: 475 stale files out, with a safety belt
My mm-triage-backlog script from two days ago was only scanning inbox/ dirs. It drops reviews-of-live-products, deploy-requests-already-shipped, and procedural notes, while preserving unbuilt goal specs. It was good as far as it went — it just wasn’t going far enough.
The actual backlog was sitting in ready-for-review/ across 18 agents. 1,278 files. qa-engineer alone had 363, mostly aeo-starter-kit-*.md QA reports for a product that’s been live for a week.
I extended the script with four design decisions:
- Drop criteria for
ready-for-review/: the union of (a) files for live products, (b) files for killed/legacy products, (c) files whose approval has already been decided inapprovals_log(status IN ('approved','rejected')). Pure age-based drop was tempting but dangerous and excluded. LIVE_PRODUCTSqueried from Postgres with a hardcoded fallback, so the list auto-updates as products ship.KILLED_PRODUCTSthe same, plus hardcoded legacy v1/v2 product slugs.--dry-rundefault,--applyexplicit. 1,278 files is too many to YOLO.- 24-hour WIP safety belt: anything modified in the last 24 hours is always preserved, regardless of other criteria. If Builder finished a deliverable 20 minutes ago, the triage script doesn’t touch it.
The dry-run reported 475 drops across inbox + RFR:
| Before | Drop | Remaining | |
|---|---|---|---|
| Inbox | 2,053 | 60 | 1,993 |
| Ready-for-review | 1,265 | 415 | 850 |
qa-engineer alone: 364 → 172 (192 dropped). Designer: 221 → 72 (149 dropped). Drop reasons were the ones you’d expect: 82 files for freelancer-ai-pivot-playbook, 76 for ai-visibility-audit-playbook, 48 where approvals_log had already recorded an approve/reject decision on the file.
The handoff from the morning had guessed 80% reduction. Reality was 33%. The gap is items for products currently in flight (deepfake-defense, api-key-security-audit, etc. — not live yet, not killed, so preserved). 33% honest is better than 80% optimistic.
Ran with --apply. 475 files archived to triage-archive-2026-04-24/ (organized by inbox/<agent>/ vs ready-for-review/<agent>/). Added a weekly cron (/usr/local/bin/mm-triage-archive-purge, Sundays 4 AM UTC) to purge archives older than 90 days. Logs to /var/log/mm-triage-archive-purge.log.
Two new skills, one that caught its own excuse
Then the skill work. Per Garry Tan’s “skillify” pattern: every failure becomes a tested, deterministic, contract-backed skill. Two were on the backlog.
codex-binary-integrity
The ACL trap’s skill. scripts/check.py with three modes: read-only check, --repair (root required), --json for cron. 18 unit tests covering getfacl parsing, agentops_can_execute detection, and the full run-repair flow with mocked filesystem state. Wired to a 6-hourly root cron that emits JSON to /var/log/codex-binary-integrity.log and auto-repairs on detection.
First live run: status: broken with mask: r--, agentops_can_exec: false.
Wait, what?
I had fixed the ACL by hand at 13:12. The Builder ping test passed at 13:13. I shipped the Dashboard sprint, the triage extension, and wrote the skill. It was now 20:58, and the ACL was broken again. Third time in 48 hours. Without a single manual chmod in between.
Nothing was logged. Codex version didn’t change (0.122.0 both before and after). Something on this system is re-applying that restrictive ACL on a regular cadence that I haven’t isolated yet.
The skill detected it immediately. Ran with --repair:
{
"status": "repaired",
"mask_before": "r--",
"mask_after": null,
"agentops_can_exec_before": false,
"agentops_can_exec_after": true,
"repair_steps": [
"chmod a+rx /usr/lib/node_modules/.../codex",
"setfacl -b /usr/lib/node_modules/.../codex"
]
}
1.1 seconds. Exit code 0. Every future occurrence is now a non-event logged in /var/log/codex-binary-integrity.log, not a mysterious 40-minute debugging session.
Lesson: skills prove their worth fastest when they catch something you literally just fixed by hand. The point of this pattern is not to spare yourself the first fix — it’s to make sure you never have to do the second one.
release-branch-verify
The skill for the April 22 --branch master silent-success incident. scripts/verify.py with three input modes: --command <str>, --scan-dir <path>, --stdin. Regex-matches wrangler pages deploy invocations (handles NODE_OPTIONS=..., sudo -u X, npx, bun x, pnpm exec/dlx, and backslash line continuations), extracts --branch <val>, flags anything other than main.
25 unit tests, including the edge cases that bit me once already (the backslash line continuation test caught a bug in my first regex — good_deploy.sh came back as a violation because the regex stopped at the first newline, never saw the --branch main on the continuation line).
Updated:
skills/resolver.yaml: both skillsplanned → active- Release Engineer’s
DIRECTIVES.md: “run all three skill scripts” before everywrangler pages deploy - Adrian’s
DIRECTIVES.md§5a: approval checklist now invokes all three; new §5b requires the codex-binary-integrity check on every CEO loop
Final tally
- 13 commits on
main, none pushed yet - Dashboard: 25 bugs identified, 14 fixed (P0+P1), P2/P3 deferred
- Skills system: 4 active, 81 unit tests passing, e2e smoke 14/14 green, resolver audit clean
- Backlog: 475 files archived, 90-day purge cron installed, 24h WIP safety belt protecting active work
- Agents: Adrian on
gpt-5.5/thinking: high; Builder ongpt-5.5withglm-5.1+deepseek-v3.2fallbacks; both validated - Weekly Codex cap: 95% remaining, burning at ~0.22%/hr pre-high-thinking (tomorrow’s number will be interesting)
Also: the blog site itself — blog.buildwithjz.com — is still a placeholder. All four of today’s posts are living in the repo, unpublished. Tomorrow-me can figure out how to ship that.
The pattern that kept working
Three things got this afternoon done:
- Read-only audit before delegation. Every sub-agent brief was grounded in verified file:line references. The Explore pass caught one material error in the plan doc (the deliverable resolver was worse than described).
- Serial beats parallel when you can live-test. All eight dashboard days ran on the same worktree, back to back, with a health-check between each. Merge conflicts: zero. Regressions: zero.
- Skill it the moment you fix it manually. The codex-binary-integrity skill would have saved me another 40-minute debugging session at 20:58 today — if I’d built it any later. Next time the ACL breaks, I won’t even notice.
Tomorrow: either cron-timeout-audit or health-monitor-pattern-eval as the next skill, plus whatever the dashboard P2/P3 queue shakes out to.
Revenue still $0. But the machinery that produces the stuff that might one day generate it is meaningfully less broken than it was this morning.
— Jeff