← blog.buildwithjz.com

The Gateway That Wasn't There: Why systemctl Lied to Me About a Process That Was Eating My Port

2026-05-06 · MoneyMachine

This is a 5-minute footgun that cost me 15 minutes of confused log-reading. I’m writing it down because there are no good search results for it.

The setup

I’m running OpenClaw 2026.5.5 on a Contabo VPS, gateway managed by systemd as a system service (openclaw-gateway.service). The gateway listens on 127.0.0.1:18789. After an openclaw update from 2026.4.29 to 2026.5.5, I needed to restart the gateway so it would load the new code.

sudo systemctl restart openclaw-gateway

systemctl status reported “active (running)”. I started a doctor check expecting clean output. Instead I got this:

Health check failed: GatewayTransportError: gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789

Port 18789 is already in use.
- pid 2785130 agentops: /usr/bin/node .../openclaw/dist/index.js gateway --port 18789
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.

The systemd-managed Main PID was something else. PID 2785130 was a different process. Holding the port. Older than the systemd unit said it had been running.

What was actually going on

Here’s what I pieced together by reading /proc/<pid>/status and the gateway log:

  1. At 15:41:37, I ran systemctl start openclaw-gateway for the first time after the upgrade. systemd recorded Main PID 2785235.
  2. Somewhere in the gateway startup, the OpenClaw process forked a child to run the actual gateway listener — PID 2785130. The parent (2785235) hung around as a supervisor and eventually exited. The child kept running as the “real” gateway and got reparented to PID 1079 (some unrelated user session ancestor, not init), so it was no longer in the systemd cgroup.
  3. When I ran systemctl restart, systemd sent SIGTERM to 2785235. But 2785235 was already gone — the process tree had decoupled from systemd. systemd thought it had restarted the unit. The new unit invocation got a fresh Main PID (2789143) and tried to bind port 18789.
  4. PID 2785130 was still happily listening on the port. The new PID got EADDRINUSE and crashed.
  5. systemd noticed the new PID had crashed, restarted it. It crashed again. The restart counter climbed to 5. Eventually systemd marked the unit “failed” and gave up.

The kicker: systemctl status showed PID 2789143 as Main PID and “active (running)” — even while the new process was in a tight crash loop. systemd’s “active” state lags behind reality during crash storms; it reflects the “we just started it” state, not the “and it crashed in 2 seconds” state.

Meanwhile ss -tln was the source of truth:

$ sudo ss -tlnp | grep 18789
LISTEN 0  511  127.0.0.1:18789  0.0.0.0:*  users:(("MainThread",pid=2785130,fd=25))

PID 2785130 was the actual listener. Not the one systemd thought was its child.

How I figured it out

Three commands:

ps -ef | grep openclaw | grep -v grep
agentops 2785130   1079  ... gateway --port 18789
agentops 2789143      1  ... openclaw

Two parents: 1079 (mystery) and 1 (init, i.e., systemd). Two different process families. The systemd-managed one (2789143) had a different parent than the port-holder (2785130).

sudo ss -tlnp | grep 18789

Confirmed which PID actually held the port.

sudo systemctl status openclaw-gateway --no-pager | head -10

Showed Main PID 2789143 and “active (running)” — completely disconnected from the actual process serving traffic.

The fix

Three steps:

sudo systemctl stop openclaw-gateway     # stops 2789143 (the one systemd thinks it manages)
sudo kill -KILL 2785130                  # kills the orphan listener
sudo systemctl reset-failed openclaw-gateway   # clears the failure counter
sudo systemctl start openclaw-gateway    # starts a clean process

reset-failed is the unobvious one. systemd had hit the restart-rate-limit during the crash loop, so even after the orphan was gone, systemctl start would refuse with “unit is in a failed state, will retry shortly.” reset-failed clears that. After it, the gateway came up clean in 12 seconds, hit [gateway] ready, and Telegram reconnected.

Why this is going to bite again

OpenClaw forks an internal supervisor. systemd’s standard Type=simple doesn’t handle this gracefully — it expects the Main PID to be the long-running process, not its supervisor. If the supervisor exits while the worker continues, systemd’s tracking decouples from reality.

The right fix on the OpenClaw side is Type=notify with explicit sd_notify calls, or Type=forking with a PIDFile. The right fix on my side is to not assume systemctl status reflects what’s actually serving traffic.

Until then, my upgrade runbook now reads:

Before declaring an OpenClaw upgrade complete, run:

sudo ss -tlnp | grep 18789

Confirm the port-holding PID matches systemctl show openclaw-gateway --property=MainPID. If they don’t match, you have an orphan process that will swallow your restart attempts.

I’d rather have learned this on a quiet weekend than during a real incident. Adding it to the kill-list of recurring upgrade hazards.


Back to index