QSOE/L: The first spawn from real storage

The last post ended with QSOE/L still in the emulator, and the next job named in plain words: carry the image across to the Unmatched and find out which of my QEMU assumptions were lies. The image went across. It got further than I'd have guessed — far enough to bring up the PCIe root, raise devb-nvme over real DesignWare MSI, spin up the Samsung NVMe, and mount an on-disk filesystem off it — and then it wedged. This post is the lie it found, which turned out to be a foundational one. Not in seL4, not in the userspace, but in the seam between them.

Where v0.13 stood

By v0.13, QSOE/L had a storage stack. On the board the boot reached this and meant every word of it:

[init] mounting /dev/nvme0n1p8 at /usr...
fs-qrv: qrvfs v2, 4096 blocks, 128 inodes
fs-qrv: mounted qrvfs at /usr (dev=/dev/nvme0n1p8)

And then nothing. No panic, no fault report, no login prompt — a clean wedge exactly one line after a mount that had plainly worked.

Three facts framed the hunt, and all three turned out to matter:

  • QEMU/L did not hang. The identical userspace booted all the way to a shell under qemu-system-riscv64.
  • The board hung only after the mount. Everything up to and including mounting /usr ran; the very next step did not.
  • QSOE/N did not hang on the same board. Skimmer, booted from the same Unmatched off the same NVMe, reached a login prompt.

Same hardware, same disk, same userspace — one kernel hangs and the other does not. That asymmetry is the whole story. It just took a while to read it.

The red herring

The fingerprint — works on QEMU, fails deterministically on the U74 — is a famous one on RISC-V, and I have a standing rule about famous fingerprints: before any strange bug hunt, search the QRV history for the same symptom. The sister project, a QNX port to the same silicon, almost certainly hit it first.

It had. Claude Code surfaced QRV commit 8d42587b in one search — "riscv: I-cache coherence on program load (fix FU740 spawn faults)" — describing exactly this shape. RISC-V keeps no coherence between data stores and instruction fetch, so a program written into recycled pages by a loader and then dispatched to a hart still holding the previous tenant's I-cache lines fetches stale instructions and jumps into garbage. Invisible on QEMU, which models no I-cache; deterministic on the U74. I'd even said as much in the last post: fence.i is the cost QEMU never makes you pay and the U74s will.

On QSOE/N the fix was already in place, crediting that commit by name, which explained why Skimmer was clean. On QSOE/L it looked worse: seL4 exposes no userspace instruction-cache operation on RISC-V at all, and its internal ifence runs only at boot and in the reschedule IPI, never on a page map. So user code loaded on seL4/RISC-V is genuinely never made coherent except by the accident of a cross-hart reschedule. The theory fit. A small patch to the vendored seL4 — ifence on the exec-page map — slotted neatly into the patch block the FU740 already needed. It was ready to apply.

I didn't apply it. Patching a verified kernel on the strength of a matching fingerprint is exactly the kind of plausible wrong move that costs a day and leaves a gratuitous modification in the one component whose entire value is that it is not modified. A matching symptom is a hypothesis, not a diagnosis. The cost of confirming first was one more boot at a higher debug level. So before touching the kernel, we read more trace.

Reading the silence

taskman has two debug surfaces. One is the loader line: every successful spawn prints spawn: <name> ... after the ELF has been located and read, so its presence or absence cleanly separates "couldn't load the program" from "loaded it and then misbehaved." The other, at the TRACE level, is a line per incoming message — tm_msg 0x<label> — a wall of text the right word for which is firehose.

Even at the plain level, the trace killed the I-cache theory on shape alone. Every spawn: line in the boot was a program from the boot cpio — slogger, pci-server, devc-sersifive, devb-nvme, fs-qrv. The trace then stopped at the mount with no spawn: line after it. We never reached the point of running a single byte of code loaded from disk. And that is decisive: the I-cache bug manifests after a successful load, as a hart fetching stale instructions and faulting — and a fault on QSOE/L is an seL4 fault delivered to taskman's handler, which prints. What we had was a silent wedge before any disk-loaded program ran. Stale-cache execution is a noisy crash, not a quiet block. Whatever was wrong was happening inside taskman while it was still trying to read the next program, not run it.

The firehose names it

At the TRACE level, the last four lines before the silence were the entire diagnosis:

tm_msg 0x30b   <- TM_REQ_ACCESS   [ -x /usr/sbin/sysinit/level1.sh ]
tm_msg 0x200   <- TM_REQ_MMAP     (spawn args page)
tm_msg 0x200
tm_msg 0x11b   <- TM_REQ_SPAWN    exec level1.sh    --- then DEAD

Two things in those lines closed the case.

First, the 0x11bTM_REQ_SPAWN — has no spawn: loader line behind it. Every other spawn in the whole boot prints that line immediately. This one took the request and wedged inside the spawn, before the image was loaded.

Second, the run-up names the program. The TM_REQ_ACCESS is init's [ -x /usr/sbin/sysinit/level1.sh ] test, and it succeeded — because the inode was already cached from the mount. Then init ran exec /usr/sbin/sysinit/level1.sh, which became the fatal spawn. And level1.sh is the first program in the entire boot that is not in the cpio. Every earlier spawn was a driver or server baked into taskman.elf. This is the first time taskman must go to the disk for an image — and therefore the first time taskman itself becomes a client of fs-qrv.

The other half of the proof was already up in the trace, during the mount: long, regular bursts of exactly four opcodes — SYNC_WAIT, PULSE_SEND, PULSE_FETCH, SYNC_WAKE, dozens of times. Those are messages to taskman. That is how fs-qrv and devb-nvme coordinate an NVMe I/O completion: the driver's interrupt thread waits, and on the MSI completion it wakes a condition variable — a Sync object — and the wakeup travels through taskman, which is the server that implements SyncWait / SyncWake and pulse delivery.

The deadlock

Put the two halves together.

During the mount, the blocked reader was fs-qrv — a separate process — and taskman was free to sit in its dispatch loop and service the Sync and pulse traffic that carried each NVMe completion. Every block read finished.

On the spawn, the blocked reader is taskman itself. It sends a read request to fs-qrv and blocks waiting for the reply. fs-qrv issues the NVMe read; the completion needs a SYNC_WAKE delivered through taskman — and taskman is parked, waiting for fs-qrv's reply. It cannot service the wake that fs-qrv is waiting on. Circular wait. The board goes quiet.

This is the project's own "no back-traffic to taskman" rule, seen from the other side. The rule normally reads: a resource manager that taskman forwards to must not call back into taskman before replying. Here the violation is inverted — taskman is the blocking client of a chain (fs-qrvdevb-nvme → NVMe completion) that loops back to taskman for the very synchronization it needs to make progress. Same cycle, entered from the top instead of the bottom.

That also explains why the ACCESS test a few messages earlier survived. It is an fs operation too, but it was served from the inode table the mount had already pulled into fs-qrv's memory: no new NVMe read, no completion, no Sync round-trip, no deadlock. The spawn is the first request that forces a data-block read of a file the mount had not cached, and so the first request to drive the NVMe completion path while taskman is the one blocked on it.

Why QEMU and Skimmer escape it

The asymmetry I started with falls straight out of the mechanism — and this is where the two-variant design stops being a build option and becomes a diagnostic instrument.

QEMU/L doesn't hang because under QEMU the block device is devb-virtio, not devb-nvme, and virtio-mmio completion does not transit taskman's Sync the way the NVMe interrupt-thread-plus-condvar path does. taskman can block on a virtio-backed read without parking the machinery that read needs.

QSOE/N doesn't hang on the same board because Skimmer's synchronization primitives are kernel-mediated, not taskman-mediated. When NQ's taskman blocks on an fs-qrv read, the completion's wake runs through the kernel, not through the blocked taskman. The cycle never forms.

"QEMU works, hardware hangs" and "NQ works, LQ hangs" aren't noise to explain away. They're two independent constraints any correct theory has to satisfy. The deadlock satisfies both — virtio vs. NVMe completion, kernel-mediated vs. taskman-mediated Sync. The I-cache theory satisfied only the first. That, on its own, was enough to know the I-cache patch would have been a day spent modifying a verified kernel to fix a bug it doesn't have.

And it puts the bug exactly where the whole project says the interesting bugs live: not in seL4, not in the shared userspace, but in the seam — in a decision about where a QNX primitive lives. Putting Sync in taskman is reasonable, QNX-shaped, and works perfectly until the day taskman needs to be a client of something that itself needs Sync. The shared userspace never had to know. Two kernels running the identical userland is what made the seam legible: the variant that mediates Sync in the kernel is fine, the variant that mediates it in taskman is not, and the diff between those two sentences is the diagnosis.

The fix, in outline

The cause is pinned; the cure isn't landed yet, and I'd rather record the reasoning now and the patch when it lands. The shape is clear: taskman must not be a blocking client of a chain that loops back to it for synchronization. The candidates, in rough order of preference:

  • Offload the spawn-image read to a helper thread. The dispatch loop stays live to service the SYNC_WAKE and pulse traffic the read's NVMe completion depends on, while a worker owns the blocking fs-qrv read and hands the loaded image back. Keeps Sync-in-taskman intact.
  • Make the fs-image read asynchronous — the same idea as a state machine in the dispatch loop instead of a second thread.
  • Route NVMe completion off the taskman path entirely, so no fs read ever depends on taskman being schedulable. The largest change, and the one that most questions where Sync belongs at all.

One cheap thing is worth doing before any of them: a single-hart build. If the board still hangs on one hart, every residual cross-hart and coherence theory — the I-cache lead chief among them — is ruled out for good and the deadlock diagnosis stands completely alone.

What I'd keep from this

If there's one thing to take from the hunt, it's the order of operations, not the bug:

  1. Check the sister project's history first. The I-cache lead came out of QRV's log by commit hash in one search. The rule earns its keep even when the lead is wrong — it's far cheaper to rule a known cause in or out than to rediscover it.
  2. Don't patch a verified kernel on a fingerprint. A matching symptom is a hypothesis. Confirming cost one more boot; guessing wrong would have cost a permanent, gratuitous modification to the one component whose value is that it's unmodified.
  3. The last message before the silence is the diagnosis. tm_msg 0x11b with no spawn: after it said "wedged inside the spawn, before the load," and that was the whole answer. Everything else was reading the run-up to confirm it.
  4. Let the asymmetry name the mechanism. Two independent "works here, fails there" constraints narrow the search far faster than staring at the wedge does.

The seam between a QNX userspace and a capability kernel is mostly a set of decisions about where each primitive lives, and most of them are invisible until the day a primitive's owner has to become a client of the primitive. This was the first such day on real storage. It won't be the last.

The bug-hunt was paired work with Claude Code — it found the QRV commit by hash; I made it stop before patching the kernel. Both halves were the point.

Comments

Popular posts from this blog

QSOE project v0.1 is released

How QSOE started

QSOE/L: the same userspace, on seL4