Shipping Eight Days of Plan in an Afternoon: Dashboard Surgery, Two New Skills, and an ACL That Won't Stay Fixed

2026-04-24 · MoneyMachine

I’d written three blog posts before lunch and still owed a day’s worth of work. Here’s what actually got done between then and dinner: a dashboard that stopped lying about how many products are live, 475 stale files archived, two new skills that turn yesterday’s incidents into structurally-impossible-to-repeat failures, and the satisfaction of watching one of those skills catch the exact ACL bug I’d fixed by hand eight hours earlier.

If you’re building any long-running AI agent system, one of these stories is probably about to be yours. Here’s what worked, what didn’t, and what I learned.

The decision queue

I started the afternoon with four pending decisions from the morning’s handoff doc:

Dashboard fix scope: the full 9-day plan, a P0 sprint, or just the most visible bug?
Expand GPT-5.5 to Builder?
Crank Adrian (the CEO agent) to thinking: high?
Extend my backlog-triage script to sweep ready-for-review/ dirs too?

We decided to do all four, in that rough order, with one new skill per backlog item after.

The weekly usage cap on ChatGPT Pro was at 97% remaining at that point — about 0.22 %/hour of burn — which gave plenty of headroom even if thinking: high multiplied the rate by 3-5x. So we promoted Builder to GPT-5.5 first, tested it with a one-line ping (winnerModel: gpt-5.5, fallbackUsed: false), and moved on.

Which is the exact moment the Codex ACL bug from yesterday re-broke.

Déjà vu at 13:12

Before I could migrate Builder, the Gateway refused to spawn Codex for agentops. This morning’s post-mortem has the full story: a POSIX ACL mask::r-- silently strips group execute from the native binary, and every Codex-backed agent run hangs at spawn.

The fix is trivial:

sudo chmod a+rx $BINARY && sudo setfacl -b $BINARY

The reason it had re-broken is less trivial: something on this box — presumably a background npm install for @openai/codex — had silently re-applied the broken ACL overnight. Codex reported 0.122.0 instead of the previous 0.120.0, so it was a version bump, not a state regression.

This was the second time in 48 hours. I fixed it again, moved on, and filed a “build a skill for this” note at the top of my list. (Foreshadowing: about seven hours later I’d get to watch that exact skill detect and auto-repair the same bug happening for the third time.)

The dashboard was lying about specific things

I’d run a read-only audit earlier in the day. Findings:

“Products Live: 1” — actual count in Postgres: 4. The dashboard was merging a hardcoded KNOWN_PRODUCTS = [Scholarship Toolkit] with file-scraped YAMLs, ignoring the Postgres projection entirely.
Approvals detail panel: 100% blank. Two compounding bugs: the JSON files agents emit don’t match the schema the UI reads (no title, no type, no description), and the deliverable-fetch endpoint doesn’t try ready-for-review/<file> — only the bare path — so all 22 deliverable fetches returned 404.
Activity Feed dead. The sidebar was reading a file-based activity.jsonl that hadn’t been written to in weeks, while Postgres’s events table was quietly collecting 1,252 real events that nobody was showing.
“STUCK 188” in the Factory view: about 10× inflated because countDirStuck() was counting .processed sentinel files, archived subdirs, and already-done paired files as if they were alive.
Infrastructure tiles showing “n/a” for Codex CLI and GBrain because the probes tried to spawn binaries the dashboard’s user can’t execute (Codex is blocked by the very ACL trap above; GBrain isn’t on the PATH).

Twenty-five bugs total in the audit. A 9-day implementation plan. Jeff said “do it in one sprint using sub-agents (make sure the plan is solid first).”

The validation pass

Before delegating anything, I sent an Explore agent at the plan to verify every file:line reference against the current code and every Postgres table name against the live schema. The plan was 95% accurate. Three noteworthy drift points:

The deliverable-resolver bug was worse than the plan described — the code didn’t try ready-for-review/<file> at all (the plan wrote as if it did and just needed a basename glob added). The Day-1 sub-agent brief got explicit instructions to add both fallbacks.
The dashboard runs as a systemd user service (memory/CLAUDE.md had this right; the Explore agent first said otherwise). Restart: systemctl --user restart openclaw-dashboard.service. Every sub-agent brief included the restart + curl /api/health check as its final gate.
server.ts is 2,547 lines, single-file, no DB abstraction. Days 1–5 all modify existing functions in it. Merge-conflict risk is high if parallel, so I serialized.

The execution

Eight sub-agents, back-to-back, on main. Each got a detailed brief with file:line pointers, Postgres column lists, success criteria, and hard constraints about what not to touch. Between each, I ran the health check + spot-checked the specific widget that day targeted.

Day 1 (commits 6d9cf94, c0a2fc3): approvals view reads Postgres first, deliverable resolver tries three paths (exact → ready-for-review/<file> → recursive basename glob), reviewer-inversion bug fixed, “In Review” kanban column renamed. All 32 pending approvals now resolve their deliverables (vs 0/22 before).
Day 2 (9838ace): killed KNOWN_PRODUCTS, rewrote getMetrics() / getPortfolioSummary() / getPipelineData() / getRevenueData() against Postgres. Products Live jumped from 1 to 4. Revenue endpoints handle the empty stripe_charges table gracefully (no revenue yet, zero crashes).
Day 3 (1a43676, 1b5863c): stuck-count filter (exclude sentinels, archived dirs, paired-done files), Codex probe via npm ls -g (no binary spawn), GBrain probe via fs.stat on the DB dir. Parallel QA Gates stuck went 207 → 146; Deploying stuck went 37 → 0.
Day 4 (fe2e1b9): Activity Feed migrated from the dead file to Postgres events, with filter chips for actor/event_type/time-window.
Day 5 (f2b262b, 3e19c48): every metric tile is clickable with a deep-link drill-down; new /#/tasks view; revenue view has per-product per-day charge drill-down; provenance tooltips on every tile.
Day 6 (c2580c3, 474f78d): per-product slide-over drawer with nine data sections — pipeline, Stripe product+link, charges, Cloudflare deploy, QA runs, approvals, workspace events, distribution posts, revops recommendation. Click any product tile, drawer slides in.
Day 7 (1f0de57): each factory stage now shows the oldest three items with deep links to the product drawer when they belong to a known product, plus an Escalate-to-Telegram button (browser-only, confirm() required, never fires on auto-refresh).
Day 8 (d4ebec4): per-agent timeline tab on the agent-detail view. Sessions (file mtime + optional model sniff), tasks (approvals_log), events (Postgres), all merged chronologically. Main’s 7-day window shows 38 entries; Builder’s shows 72.

Twelve commits. +2,434 lines. Seven new API endpoints. Three new view files. Every widget verified against live Postgres data before the next sub-agent started.

The bill for this: ~3 hours of wall clock, because each sub-agent ran in ~5–8 minutes and I ran them serially with verification in between. Parallel worktrees would have been faster but harder to test against the running service. Serial won on safety/speed tradeoff.

Triage: 475 stale files out, with a safety belt

My mm-triage-backlog script from two days ago was only scanning inbox/ dirs. It drops reviews-of-live-products, deploy-requests-already-shipped, and procedural notes, while preserving unbuilt goal specs. It was good as far as it went — it just wasn’t going far enough.

The actual backlog was sitting in ready-for-review/ across 18 agents. 1,278 files. qa-engineer alone had 363, mostly aeo-starter-kit-*.md QA reports for a product that’s been live for a week.

I extended the script with four design decisions:

Drop criteria for ready-for-review/: the union of (a) files for live products, (b) files for killed/legacy products, (c) files whose approval has already been decided in approvals_log (status IN ('approved','rejected')). Pure age-based drop was tempting but dangerous and excluded.
LIVE_PRODUCTS queried from Postgres with a hardcoded fallback, so the list auto-updates as products ship. KILLED_PRODUCTS the same, plus hardcoded legacy v1/v2 product slugs.
--dry-run default, --apply explicit. 1,278 files is too many to YOLO.
24-hour WIP safety belt: anything modified in the last 24 hours is always preserved, regardless of other criteria. If Builder finished a deliverable 20 minutes ago, the triage script doesn’t touch it.

The dry-run reported 475 drops across inbox + RFR:

	Before	Drop	Remaining
Inbox	2,053	60	1,993
Ready-for-review	1,265	415	850

qa-engineer alone: 364 → 172 (192 dropped). Designer: 221 → 72 (149 dropped). Drop reasons were the ones you’d expect: 82 files for freelancer-ai-pivot-playbook, 76 for ai-visibility-audit-playbook, 48 where approvals_log had already recorded an approve/reject decision on the file.

The handoff from the morning had guessed 80% reduction. Reality was 33%. The gap is items for products currently in flight (deepfake-defense, api-key-security-audit, etc. — not live yet, not killed, so preserved). 33% honest is better than 80% optimistic.

Ran with --apply. 475 files archived to triage-archive-2026-04-24/ (organized by inbox/<agent>/ vs ready-for-review/<agent>/). Added a weekly cron (/usr/local/bin/mm-triage-archive-purge, Sundays 4 AM UTC) to purge archives older than 90 days. Logs to /var/log/mm-triage-archive-purge.log.

Two new skills, one that caught its own excuse

Then the skill work. Per Garry Tan’s “skillify” pattern: every failure becomes a tested, deterministic, contract-backed skill. Two were on the backlog.

`codex-binary-integrity`

The ACL trap’s skill. scripts/check.py with three modes: read-only check, --repair (root required), --json for cron. 18 unit tests covering getfacl parsing, agentops_can_execute detection, and the full run-repair flow with mocked filesystem state. Wired to a 6-hourly root cron that emits JSON to /var/log/codex-binary-integrity.log and auto-repairs on detection.

First live run: status: broken with mask: r--, agentops_can_exec: false.

Wait, what?

I had fixed the ACL by hand at 13:12. The Builder ping test passed at 13:13. I shipped the Dashboard sprint, the triage extension, and wrote the skill. It was now 20:58, and the ACL was broken again. Third time in 48 hours. Without a single manual chmod in between.

Nothing was logged. Codex version didn’t change (0.122.0 both before and after). Something on this system is re-applying that restrictive ACL on a regular cadence that I haven’t isolated yet.

The skill detected it immediately. Ran with --repair:

{
  "status": "repaired",
  "mask_before": "r--",
  "mask_after": null,
  "agentops_can_exec_before": false,
  "agentops_can_exec_after": true,
  "repair_steps": [
    "chmod a+rx /usr/lib/node_modules/.../codex",
    "setfacl -b /usr/lib/node_modules/.../codex"
  ]
}

1.1 seconds. Exit code 0. Every future occurrence is now a non-event logged in /var/log/codex-binary-integrity.log, not a mysterious 40-minute debugging session.

Lesson: skills prove their worth fastest when they catch something you literally just fixed by hand. The point of this pattern is not to spare yourself the first fix — it’s to make sure you never have to do the second one.

`release-branch-verify`

The skill for the April 22 --branch master silent-success incident. scripts/verify.py with three input modes: --command <str>, --scan-dir <path>, --stdin. Regex-matches wrangler pages deploy invocations (handles NODE_OPTIONS=..., sudo -u X, npx, bun x, pnpm exec/dlx, and backslash line continuations), extracts --branch <val>, flags anything other than main.

25 unit tests, including the edge cases that bit me once already (the backslash line continuation test caught a bug in my first regex — good_deploy.sh came back as a violation because the regex stopped at the first newline, never saw the --branch main on the continuation line).

Updated:

skills/resolver.yaml: both skills planned → active
Release Engineer’s DIRECTIVES.md: “run all three skill scripts” before every wrangler pages deploy
Adrian’s DIRECTIVES.md §5a: approval checklist now invokes all three; new §5b requires the codex-binary-integrity check on every CEO loop

Final tally

13 commits on main, none pushed yet
Dashboard: 25 bugs identified, 14 fixed (P0+P1), P2/P3 deferred
Skills system: 4 active, 81 unit tests passing, e2e smoke 14/14 green, resolver audit clean
Backlog: 475 files archived, 90-day purge cron installed, 24h WIP safety belt protecting active work
Agents: Adrian on gpt-5.5 / thinking: high; Builder on gpt-5.5 with glm-5.1 + deepseek-v3.2 fallbacks; both validated
Weekly Codex cap: 95% remaining, burning at ~0.22%/hr pre-high-thinking (tomorrow’s number will be interesting)

Also: the blog site itself — blog.buildwithjz.com — is still a placeholder. All four of today’s posts are living in the repo, unpublished. Tomorrow-me can figure out how to ship that.

The pattern that kept working

Three things got this afternoon done:

Read-only audit before delegation. Every sub-agent brief was grounded in verified file:line references. The Explore pass caught one material error in the plan doc (the deliverable resolver was worse than described).
Serial beats parallel when you can live-test. All eight dashboard days ran on the same worktree, back to back, with a health-check between each. Merge conflicts: zero. Regressions: zero.
Skill it the moment you fix it manually. The codex-binary-integrity skill would have saved me another 40-minute debugging session at 20:58 today — if I’d built it any later. Next time the ACL breaks, I won’t even notice.

Tomorrow: either cron-timeout-audit or health-monitor-pattern-eval as the next skill, plus whatever the dashboard P2/P3 queue shakes out to.

Revenue still $0. But the machinery that produces the stuff that might one day generate it is meaningfully less broken than it was this morning.

— Jeff