QSOE/L: the same userspace, on seL4

The three QSOE/N posts ended with a promise — the other half of the umbrella, QSOE/L, the same userspace running on seL4. This is that post. It is also, in a sense, the first post, because QSOE/L is where the whole thing began. Tomorrow is one month to the day since I started the real QSOE, and the first thing that booted was an seL4 image. QSOE/N, the from-scratch-kernel variant I wrote about first, is the younger sibling. I led with it because the milestone there was loud: a microkernel I wrote myself taking its first breath on real silicon. QSOE/L's story is quieter and, for the shared userspace, more foundational. Everything that runs on Skimmer today was made to run on seL4 first.

No kernel to write

On seL4 there is no kernel to write, and that is the entire point. seL4 is a verified, capability-based microkernel; the job was never to build a kernel but to build a QNX-Neutrino-shape operating system on top of one. seL4 has opinions, and the strongest of them is about IPC. Its IPC is a protected procedure call — a control-flow primitive, in Heiser's framing, not a data pipe — and bulk data is expected to move through shared memory. Every message in the system rides a fixed-size, per-thread IPC buffer.

That single constraint shaped the whole design, because QSOE's wire protocol and its MsgSend / MsgReceive / MsgReply surface had to be honest about it rather than pretend the kernel offered a fat pipe it doesn't. This is the same fixed-buffer limit that led Andrew Warkentin to fork seL4 outright for UX/RT. I chose, for now, to live within it and push bulk data out of band.

taskman, as an seL4 rootserver

QSOE/N's taskman is the procnto analogue, spawned before any other user thread. QSOE/L's taskman is the same idea wearing seL4's clothes: it is the rootserver. seL4 hands it the BootInfo — every untyped memory region on the board, the initial CSpace, the initial caps — and from there taskman builds everything else by hand. There is no libsel4 runtime under it in the early versions; the first commits carried hand-rolled invocation wrappers whose marshalling I checked against the RISC-V output of upstream's syscall_stub_gen.py. That matters, because getting a CNode guard field shifted by six bits gives you a child process whose every capability lookup silently lands in a null slot. It did, once. Bits 0–5 are the guard size; bits 6 and up are the guard value. That one cost an evening.

The division of labour is the part worth stating plainly, because it is the seL4-specific heart of the design:

  • taskman holds the master Endpoint caps and all the registries — channels, connections, processes, paths. When a process calls ChannelCreate, taskman mints a badged Send cap; on ConnectAttach, another. taskman owns the sole master cap to each channel, so a single Revoke on exit tears the whole thing down.
  • libqsoe — now folded into libc — keeps only per-process handle tables: chid to slot, coid to slot. It never sees a master cap. For anything that mints or destroys, it talks to taskman over the wire.

That split is what lets the QNX API sit on capabilities cleanly. MsgSend is a badged seL4_Call, and the badge taskman sees on arrival is the caller's pid, minted into the cap at spawn — so taskman attributes every request without trusting a byte of the payload. MsgReceive and MsgReply are seL4_Recv and seL4_ReplyRecv. On the non-MCS kernel I started with, the reply cap was the per-thread caller slot, which is why MsgReply took an rcvid only for QNX symmetry; the real cap was implicit. That changed with MCS — more below.

The QNX touches go all the way down. Every process is born with a pre-bound SYSMGR_COID — the side-channel connection to taskman, bit 30 set, matching QNX's _NTO_SIDE_CHANNEL convention — so the fd allocator can hand out 0/1/2 for stdio without ever colliding with it. That side channel is load-bearing for the fd ≡ coid identity the whole resource-manager wire is built on.

From four IPC calls to a shell

The first month's worth of versions is the QNX userland coming up one layer at a time, each layer a tag.

v0.2 was the four channel/connection calls running inside taskman. v0.3 split them across the IPC boundary: a second process, tester, spawned by a hand-rolled ELF loader that retypes the child's whole object set and scratch-maps each segment through taskman's own VSpace; then the Msg* trio; then real cross-process round-trips; then the introspection calls and the side-channel coid.

SMP arrived at v0.3.4 as what it is on seL4 — a build flag. -DSMP=TRUE -DNUM_NODES=4, a five-minute kernel rebuild, secondary harts brought up over OpenSBI HSM. The QNX threading API at v0.4 — ThreadCreate / Join / Cancel / Ctl, TLS through the RISC-V tp register, a trampoline that calls ThreadDestroy on return — was correct under SMP from the first commit. That is not a thing to be proud of; it is seL4 doing its job. The SMP work worth being proud of is on the Skimmer side, designed in rather than configured on, and that is the QSOE/N story.

Then the rest of what an OS needs: a process lifecycle with a real pid allocator; QNX pulses as async fixed-size messages, delivered through per-channel notifications bound to the owner's TCB; argv/envp built into a SysV initial stack; a path manager as a prefix-tree registry; /dev/console as the first resource manager; stdio inherited as three badged caps minted into every child at spawn. At v0.5 musl linked and the first printf crossed the wire to the console one byte at a time. At v0.6 the embedded CPIO archive became cpiofs, a read-only filesystem mounted at /, and /sbin/init spawned the first userland driver — devc-ser8250, a 16550 UART resource manager taking PLIC interrupts. At v0.6.4 qsh printed its first #. By v0.7 it was a usable shell on the real UART, with a line discipline that handles backspace and erase, and /sbin/pipe as the first non-driver system program.

Every one of these came with a cap-leak check — delta zero over a hundred channel create/destroy cycles — because on a capability kernel a leak is not a slow memory creep, it is a hard wall you hit at a fixed count.

What seL4 makes you confront

A capability microkernel hands you safety and hands you bookkeeping. Two problems a from-scratch kernel hides inside its own allocator, seL4 puts squarely in the rootserver's lap — and the channel-cycle check above didn't catch either of them, because they live in the spawn path.

Memory. Every object a process needs — TCB, page tables, image frames, stack, anonymous mmap pages — is retyped out of untyped memory. An untyped block's free watermark only rewinds when the block is empty, and taskman's own permanent objects keep the big block populated forever. So for a while every spawn advanced the watermark for good and the system died after a few dozen launches. The fix is to give each process its own untyped block, route all its retypes through it, and Revoke the block on exit — which leaves it child-free, so seL4 resets the watermark and the memory comes back. Blocks are carved across every RAM untyped on the board, not just the largest, so the whole board is reachable.

CSpace. The same shape one level up. taskman held a cap to every object it retyped, in its flat 4096-slot root CNode, and the ~150 image-frame caps per spawn never came back — capping the system around thirty launches. The write-once frame caps now move into a per-process object CNode at the end of spawn; their root slots return to the free list at once, and on exit the whole per-process CNode is Revoked, freeing them together.

Crashes. A spawned thread with no fault endpoint that dereferences NULL does not die — it wedges in the kernel with "no fault handler." Each process now gets a badged Send+GrantReply cap to taskman's primary endpoint installed as its fault handler; taskman spots the fault badge, logs it, and takes the process down with WIFSIGNALED / SIGSEGV, so the parent's waitpid sees a clean signal death.

Time. The biggest single shift was moving onto seL4's MCS kernel — time as a first-class citizen. The hand-rolled invocation layer grew reply objects (the reply cap rides register a6 now; a deferred reply moves the reply object aside instead of saving the caller), and every thread runs on a bound scheduling context — a TCB cannot run without one, and affinity comes from the per-core SchedControl. This is the variant difference that matters most. On the native kernel I own the timer; on seL4-MCS the kernel owns it for scheduling, so until taskman grows a budget-throttled tick thread, an idle nanosleep sleeps by polling the RISC-V time CSR and yielding each turn. That is an honest stopgap, marked LQ-only, and the real kernel timer is a v1.0 item.

The convergence

For the first three weeks, QSOE/L's userland lived in its own tree — userland/taskman, userland/libqsoe, userland/libc, the utilities. Then, on June 2, it collapsed. The umbrella reorganization moved every variant-independent program out of the L tree into shared sibling repos — quser for the userland, a shared libc, a shared libtaskman — and left behind exactly two things on the L side: taskman, and the per-kernel libc seam.

That seam is the whole bet of the project, so it is worth being concrete about where it lands. The shared libc carries the OS-independent body: the musl-derived POSIX entry points, stdio, the string and math code. The L seam under lq/libc/qsoe/ carries the seL4-specific half — the QNX-native API (Channel*, Connect*, Msg*, Thread*, Sync*), the mmap wrappers that turn into TM_REQ_MMAP wire calls, signals as per-process pulse dispositions, posix_spawn over the spawn wire. When the shared userspace needs something the L kernel cannot do yet — sysinfo before taskman could assemble the record, the Timer* family before the MCS tick lands — the seam provides the symbol as a stub that announces itself loudly on first call (STUB: <name>) and returns ENOSYS, so the one shared binary keeps linking and loading and the gap shows up in the boot trace instead of as a link error. The same qsh, the same ls, the same conformance suite run on both kernels; only taskman and that thin seam know which one they are on.

This is the payoff for having designed the wire protocol to be honest about seL4's constraints back at v0.2. A choice that only made sense on a from-scratch kernel would have broken the shared half.

Hardware, in QEMU

By mid-May QSOE/L had a PCI stack: a /sbin/pci-server that maps the ECAM window with MAP_PHYS, scans bus 0, and serves /dev/pci; a libpci whose client API mirrors QRV's; an FDT walker in taskman that decodes the ECAM, the windows, and the PLIC vectors out of the device tree; and an in-taskman resource-manager database under the QRV/QNX rsrcdbmgr_* names. On qemu-virt it enumerates the virtio-net device and the emulated NVMe controller — 1b36:0010, class 0x010802 — which is the doorway the NVMe driver and a real filesystem go through next.

Around it there is a system logger (/sbin/slogger, a 64 KiB event ring with the QRV-compatible slogf surface and severity codes), a synthetic /dev that lists its char devices with Linux-recognisable major/minor numbers, /proc and /sys served from the shared libtaskman cores, and the BSD-style boot where /sbin/init is a shell script and the shebang machinery lives in the spawn path.

Where that leaves it

One month in, QSOE/L is an interactive system in QEMU. It boots seL4-MCS on four harts, brings up taskman, runs /sbin/init as a script, starts the logger and the PCI server and the UART driver, and drops you at a qsh prompt where ls, ps, cat /proc/<pid>/info, sysinfo, and shutdown all work. Processes spawn and exit and give their memory and their CSpace slots back. A NULL deref takes the process down and not the system. Dynamic linking works end to end — rtld, a shared libc.so, qsh.

What it is not yet is on hardware. The FU740 build exists — make PLAT=hifive produces a second qsoe.elf for the SiFive Unmatched, with the board's FirstHartID=1 and per-platform kernel build dirs — but it has not been booted on the board. That is the inversion I opened with: QSOE/N reached real silicon first, and QSOE/L, the elder variant, is still in the emulator. The Unmatched is on my desk. The bring-up checklist QRV wrote in blood — image header, optional PCI, declarative physmem, asymmetric topology, fence.i — applies just as much to the seL4 image, and fence.i in particular is a cost QEMU will never make me pay and the U74s will.

So the near work is two-pronged: the NVMe driver and a filesystem on top of the PCI stack that already enumerates in QEMU, and the kernel timer story that turns the polling nanosleep stopgap into a real MCS tick. And then the same thing I did for QSOE/N — carry the image across to the board and find out which of my QEMU assumptions were lies.

As with the rest of QSOE, this was paired work: Claude Code as co-author, with most of the seL4 invocation-layer archaeology and the reclamation bug-hunts done together.

Comments

Popular posts from this blog

How QSOE started

QSOE/N, part 2: from a kernel to an operating system