QSOE/N, part 2: from a kernel to an operating system

Part 1 ended with Skimmer at v0.5 — a sound microkernel with QNX-shape synchronous IPC, and nothing running on top of it. This post is the rest: how Skimmer became QSOE/N, an actual operating system with processes, a C library, drivers, and a conformance suite that passes a full lap. All of it still in QEMU; real hardware is part 3.

The umbrella

QSOE — "Quick and Secure Operating Environment" — is a QNX-Neutrino-style operating environment for 64-bit RISC-V, built entirely under Apache-2.0. It comes in two variants that share almost everything:

QSOE/N (project NQ), the native-kernel variant, running on Skimmer.
QSOE/L (project LQ), running the same userspace on the seL4 microkernel.

The split is deliberately thin. The kernel differs, the system process (taskman) differs, and a small OS-dependent slice of libc differs. Everything else — the C library body, the drivers, the utilities — is shared between the two systems. That constraint shaped a lot of the decisions below, because a wire format or an API choice that only made sense on Skimmer would have broken the shared half.

On May 30 I migrated Skimmer into the umbrella as QSOE/N. The repository was restructured into the layers it still has: include/skimmer/ (the kernel headers), kernel/lwkt/ (the DragonFly-inspired substrate — the heart), kernel/arch/riscv/ (the only place that names "riscv", Sv39, or SBI), kernel/ (arch-neutral top level), taskman/, the libc seam, and the test suite. The principle the layering encodes: per-hart state, no cross-CPU locks on the hot path, cross-hart work flows as messages. A future port — x86-64 is the obvious one — lands as a sibling kernel/arch/ directory and nothing above it should care.

taskman — the procnto analogue

taskman is QSOE/N's answer to QNX's procnto: a single statically-linked U-mode binary, loaded from the boot CPIO archive by the kernel and spawned before any other user thread. It owns the well-known channel (TASKMAN_CHID, encoded (gen=1, idx=1) = 65537) and runs a single-threaded MsgReceive → dispatch → MsgReply loop — the same shape the QSOE/L taskman runs on seL4. It owns process creation, the path namespace, memory mapping, credentials, and waitpid; it is not in the data path for ordinary file I/O, which matters a lot in the next section.

There is, by design, no SYS_SPAWN syscall. Ever. Process creation is taskman's TM_REQ_SPAWN, which composes kernel primitives: create a VSpace, map zero or more regions into it, create a thread in it. The kernel provides mechanism; taskman provides the policy that adds up to a process.

fd === coid: the resource-manager wire

This is the architectural centre of QSOE/N, and the thing I'm most pleased with.

In QSOE/N a POSIX file descriptor is a kernel connection id. coid is the slot index in the calling process's connection pool; the integer you get back from open() is that coid handed straight back as the fd. Every per-fd operation is then a single MsgSend directly to the channel that owns the connection:

open() canonicalises the path, asks pathmgr once for (server_pid, server_chid), ConnectAttaches, and sends _IO_CONNECT so the server can build its per-coid OCB. The returned coid is the fd.
read() / write() / fstat() / lseek() / close() are each one MsgSend(fd, _IO_*) straight to the resource manager.

taskman is no longer in the read/write/close/fstat/lseek path at all. It answers the structural calls — open, spawn, waitpid, mmap, pathmgr registration — and gets out of the way for the bytes. dup is ConnectServerInfo (QNX shape — it resolves a coid to its (nd, pid, chid, scoid) tuple) followed by a fresh ConnectAttach at the requested slot. fcntl(F_GETFD/F_SETFD) goes to a kernel syscall directly. The whole universal-forwarder scaffolding that an earlier draft of taskman carried — forward_io_read, the per-process fd table, the handle_fcntl/handle_dup opcodes — got deleted once this landed, because nothing routed through it anymore.

The wire itself is unified to a single qsoe_ipcbuf layout shared with QSOE/L: a tag in the first eight bytes, four scalar words, then variable payload. The typed per-request structs went away. Making the wire the same shape on both kernels is what keeps the shared-userspace promise honest.

Getting ConnectServerInfo right was a real bug, not a refactor. The QNX 8 contract is that it returns the matched coid (a miss scans upward for the next-higher live connection), and info.pid carries the channel owner's real pid. libc's F_DUPFD feeds that (pid, chid) straight back into ConnectAttach. While the kernel still gated ConnectAttach on "pid must be 0", the now-real pid made every dup fail — which surfaced as qsh being unable to open /sbin/init. The fix was to teach ConnectAttach the QNX (pid, chid) addressing it always should have had.

Multi-process: per-process VSpaces and spawn

For all of v0.1–v0.5 there was one address space. Real processes need their own. The kernel gained a per-process VSpace allocator (struct sk_vspace — root PT, ASID, refcount), td_vspace plumbed through lwkt_switch so satp + sfence land on any context switch that crosses VSpaces (a plain pointer compare; same-VSpace switches stay zero-cost), and a set of privileged TM_PRIV_VSPACE_* ops for taskman to drive.

The spawn flow is a small dance. taskman creates a fresh child VSpace, then loads the image — ELF segments, the dynamic loader, libc.so, stack, TCB — into its own boot page table, where direct stores for memcpy and relocations Just Work, and then hands ownership of the loaded pages to the child via TM_PRIV_VSPACE_TRANSPLANT. Transplant walks each leaf in the VA window, installs the same PA at the same VA in the child's root preserving perms, and zeros taskman's slot. The pages aren't freed — they just belong to the child now, and taskman's load VAs are free again for the next spawn. Then a thread is created against the child VSpace and runs.

One preparatory sweep deserves a mention because it's invisible until it isn't: every kernel-side dereference of a physical address had to move from an identity-low mapping to the canonical-high mapping that every VSpace shares (the analogue of QRV's PHYS_TO_PTR sweep). While there was one address space, identity-low and high-half resolved to the same bytes and it didn't matter. The moment a per-process VSpace switch happens, only the high-half path stays valid. Doing that sweep before the spawn flow landed is the difference between a clean bring-up and a week of "why is the kernel reading user memory."

End to end: /sbin/init spawns slogger, a PCI server, a serial driver, and qsh into four distinct VSpaces.

Drivers leave the kernel

The in-kernel PCI scanner and NVMe driver from v0.4 were always meant to be temporary — bring-up convenience. Nothing about either needs privileged mode, and keeping them in the kernel cut against the whole stance: the kernel is the absolute minimum, drivers are resource managers in U-mode. So both got deleted from the kernel. PCI moved to a userland ECAM library any resmgr can link; NVMe is headed for a devb-nvme resmgr.

After that, the kernel owns exactly: harts and threads, VSpaces and Sv39 page tables, channels and per-process connection pools, the PLIC and IST glue, the FDT walker, and the SBI console seam. No device drivers. That's the line I want to hold.

The POSIX surface, and the conformance suite

With multi-process working, the rest was filling in the system-call surface and proving it against a conformance suite that runs as an ordinary user program. Highlights:

One error vocabulary. The old dense internal error dialect (SK_E*) was retired in favour of negated POSIX-shape errno values, Linux-style (return -EINVAL;), single-sourced in a kernel errno header that mirrors the libc ABI and is pinned by a _Static_assert so the two can't drift. This was prompted by a real debugging cost: an internal -2 traveled through the seam and strerror reported "No such file or directory" for what was actually a bad-argument rejection.
Threads. Real tids, ThreadJoin (the full QNX 8 error matrix — ESRCH/EINVAL/EDEADLK/EBUSY), and ThreadCancel/ThreadCtl with QNX's deferred-cancellation model: every blocking call is a cancellation point except MsgSendvnc and SyncMutexLock. A returning worker thread now becomes ThreadDestroy(self, retval) instead of jumping to ra=0 and faulting the kernel.
Timers. TimerCreate/Destroy/SetTime/Timeout, QNX shape, driven off the existing 1 kHz tick. Signals are pulses in QSOE, so a timer must name its pulse target — event == NULL is -EINVAL, there is no SIGALRM default. nanosleep rides TimerTimeout directly, with no taskman round-trip, because pure time needs no resource manager.
Signals as pulses, end to end, including SIGCHLD fired at the parent's signal channel on child exit — sent before the waiter release so a SIGCHLD handler that calls waitpid finds the zombie in place.
Process lifecycle. PROC_EXIT, real WEXITSTATUS, waitpid(-1)/WNOHANG, full end-of-process teardown: owned channels die hard (in-flight senders complete with -ESRVRFAULT instead of parking forever), connection pools drain, page tables and user frames return to the pool, children reparent to pid 1. Then the housekeeping that a long soak forces out — per-process mmap VA cursors (one global never-rewound cursor was burning ~17 MiB of anonymous window per suite lap), pool telemetry, and a round-robin pid recycler so TM_MAXPID caps simultaneous processes rather than lifetime spawns.

A couple of those were proper bugs. SyncSemPost had a lost wakeup — it either woke a parked waiter or incremented the count, never both, so the woken waiter's re-check saw count 0 and parked forever. It was the long-standing suite hang, only fully exposed once nanosleep really slept and the waiter genuinely parked. Fixed Mesa-style: publish the count first, then wake. And the sync objects were keyed by raw user VA, which was fine with one address space and a latent disaster with many — two processes parking on the same address (likely, since identical binaries lay out identically) shared one wait queue and one owner. The key grew an address-space component, with a teardown sweep that releases every sync object of a dying space.

When the last unimplemented syscall (SchedYield, which is exactly what lwkt_yield already did) got wired, the conformance suite completed end to end for the first time: 117 tests, 104 passing, 13 failing, zero panics, consistent across 25 consecutive runs on 8 harts. The remaining failures clustered into named work packages and got knocked down over the following days; the suite now soaks for 20-plus laps in a single boot with a flat page-pool floor.

A note on message size

One design choice runs through all of this and is worth pulling out, because it's where the QSOE/N and QSOE/L stories rhyme. A QNX-shape kernel does synchronous IPC; the framing I follow (Gernot Heiser's, on seL4) is that IPC is a protected procedure call for control flow, and bulk data should move out of band. seL4 takes that to a fixed, small message buffer — small enough that Andrew Warkentin forked it for UX/RT partly over the constraint. Skimmer's inline cap is 4 KiB. Past that, QSOE/N copies the payload page-by-page through the other side's page-table physical addresses — no aliased satp window, preemption-safe, no channel lock held — sound only because the sender is REPLY-parked for the whole rendezvous and Skimmer has no demand paging. A 16 MiB ceiling fails runaway lengths fast. It's the pragmatic middle: small messages stay a fast synchronous copy, large ones don't bounce through a scratch buffer, and neither needs shared-memory setup for the common case.

Where that leaves it

QSOE/N now boots to a shell, spawns and reaps processes, runs userland drivers as resource managers, and passes a POSIX conformance suite on repeat. It is an operating system. But every line of it had run only on QEMU, which is generous about a class of hardware behaviour it doesn't model at all.

The HiFive Unmatched on my desk is not generous. That's part 3.

Search This Blog

QSOE Development Blog