QSOE/N, part 2: from a kernel to an operating system

Part 1 ended with Skimmer at v0.5 — a sound microkernel with QNX-shape synchronous IPC, and nothing running on top of it. This post is the rest: how Skimmer became QSOE/N, an actual operating system with processes, a C library, drivers, and a conformance suite that passes a full lap. All of it still in QEMU; real hardware is part 3.

The umbrella

QSOE — "Quick and Secure Operating Environment" — is a QNX-Neutrino-style operating environment for 64-bit RISC-V, built entirely under Apache-2.0. It comes in two variants that share almost everything:

  • QSOE/N (project NQ), the native-kernel variant, running on Skimmer.
  • QSOE/L (project LQ), running the same userspace on the seL4 microkernel.

The split is deliberately thin. The kernel differs, the system process (taskman) differs, and a small OS-dependent slice of libc differs. Everything else — the C library body, the drivers, the utilities — is shared between the two systems. That constraint shaped a lot of the decisions below, because a wire format or an API choice that only made sense on Skimmer would have broken the shared half.

On May 30 I migrated Skimmer into the umbrella as QSOE/N. The repository was restructured into the layers it still has: include/skimmer/ (the kernel headers), kernel/lwkt/ (the DragonFly-inspired substrate — the heart), kernel/arch/riscv/ (the only place that names "riscv", Sv39, or SBI), kernel/ (arch-neutral top level), taskman/, the libc seam, and the test suite. The principle the layering encodes: per-hart state, no cross-CPU locks on the hot path, cross-hart work flows as messages. A future port — x86-64 is the obvious one — lands as a sibling kernel/arch/ directory and nothing above it should care.

taskman — the procnto analogue

taskman is QSOE/N's answer to QNX's procnto: a single statically-linked U-mode binary, loaded from the boot CPIO archive by the kernel and spawned before any other user thread. It owns the well-known channel (TASKMAN_CHID, encoded (gen=1, idx=1) = 65537) and runs a single-threaded MsgReceive → dispatch → MsgReply loop — the same shape the QSOE/L taskman runs on seL4. It owns process creation, the path namespace, memory mapping, credentials, and waitpid; it is not in the data path for ordinary file I/O, which matters a lot in the next section.

There is, by design, no SYS_SPAWN syscall. Ever. Process creation is taskman's TM_REQ_SPAWN, which composes kernel primitives: create a VSpace, map zero or more regions into it, create a thread in it. The kernel provides mechanism; taskman provides the policy that adds up to a process.

fd === coid: the resource-manager wire

This is the architectural centre of QSOE/N, and the thing I'm most pleased with.

In QSOE/N a POSIX file descriptor is a kernel connection id. coid is the slot index in the calling process's connection pool; the integer you get back from open() is that coid handed straight back as the fd. Every per-fd operation is then a single MsgSend directly to the channel that owns the connection:

  • open() canonicalises the path, asks pathmgr once for (server_pid, server_chid), ConnectAttaches, and sends _IO_CONNECT so the server can build its per-coid OCB. The returned coid is the fd.
  • read() / write() / fstat() / lseek() / close() are each one MsgSend(fd, _IO_*) straight to the resource manager.

taskman is no longer in the read/write/close/fstat/lseek path at all. It answers the structural calls — open, spawn, waitpid, mmap, pathmgr registration — and gets out of the way for the bytes. dup is ConnectServerInfo (QNX shape — it resolves a coid to its (nd, pid, chid, scoid) tuple) followed by a fresh ConnectAttach at the requested slot. fcntl(F_GETFD/F_SETFD) goes to a kernel syscall directly. The whole universal-forwarder scaffolding that an earlier draft of taskman carried — forward_io_read, the per-process fd table, the handle_fcntl/handle_dup opcodes — got deleted once this landed, because nothing routed through it anymore.

The wire itself is unified to a single qsoe_ipcbuf layout shared with QSOE/L: a tag in the first eight bytes, four scalar words, then variable payload. The typed per-request structs went away. Making the wire the same shape on both kernels is what keeps the shared-userspace promise honest.

Getting ConnectServerInfo right was a real bug, not a refactor. The QNX 8 contract is that it returns the matched coid (a miss scans upward for the next-higher live connection), and info.pid carries the channel owner's real pid. libc's F_DUPFD feeds that (pid, chid) straight back into ConnectAttach. While the kernel still gated ConnectAttach on "pid must be 0", the now-real pid made every dup fail — which surfaced as qsh being unable to open /sbin/init. The fix was to teach ConnectAttach the QNX (pid, chid) addressing it always should have had.

Multi-process: per-process VSpaces and spawn

For all of v0.1–v0.5 there was one address space. Real processes need their own. The kernel gained a per-process VSpace allocator (struct sk_vspace — root PT, ASID, refcount), td_vspace plumbed through lwkt_switch so satp + sfence land on any context switch that crosses VSpaces (a plain pointer compare; same-VSpace switches stay zero-cost), and a set of privileged TM_PRIV_VSPACE_* ops for taskman to drive.

The spawn flow is a small dance. taskman creates a fresh child VSpace, then loads the image — ELF segments, the dynamic loader, libc.so, stack, TCB — into its own boot page table, where direct stores for memcpy and relocations Just Work, and then hands ownership of the loaded pages to the child via TM_PRIV_VSPACE_TRANSPLANT. Transplant walks each leaf in the VA window, installs the same PA at the same VA in the child's root preserving perms, and zeros taskman's slot. The pages aren't freed — they just belong to the child now, and taskman's load VAs are free again for the next spawn. Then a thread is created against the child VSpace and runs.

One preparatory sweep deserves a mention because it's invisible until it isn't: every kernel-side dereference of a physical address had to move from an identity-low mapping to the canonical-high mapping that every VSpace shares (the analogue of QRV's PHYS_TO_PTR sweep). While there was one address space, identity-low and high-half resolved to the same bytes and it didn't matter. The moment a per-process VSpace switch happens, only the high-half path stays valid. Doing that sweep before the spawn flow landed is the difference between a clean bring-up and a week of "why is the kernel reading user memory."

End to end: /sbin/init spawns slogger, a PCI server, a serial driver, and qsh into four distinct VSpaces.

Drivers leave the kernel

The in-kernel PCI scanner and NVMe driver from v0.4 were always meant to be temporary — bring-up convenience. Nothing about either needs privileged mode, and keeping them in the kernel cut against the whole stance: the kernel is the absolute minimum, drivers are resource managers in U-mode. So both got deleted from the kernel. PCI moved to a userland ECAM library any resmgr can link; NVMe is headed for a devb-nvme resmgr.

After that, the kernel owns exactly: harts and threads, VSpaces and Sv39 page tables, channels and per-process connection pools, the PLIC and IST glue, the FDT walker, and the SBI console seam. No device drivers. That's the line I want to hold.

The POSIX surface, and the conformance suite

With multi-process working, the rest was filling in the system-call surface and proving it against a conformance suite that runs as an ordinary user program. Highlights:

  • One error vocabulary. The old dense internal error dialect (SK_E*) was retired in favour of negated POSIX-shape errno values, Linux-style (return -EINVAL;), single-sourced in a kernel errno header that mirrors the libc ABI and is pinned by a _Static_assert so the two can't drift. This was prompted by a real debugging cost: an internal -2 traveled through the seam and strerror reported "No such file or directory" for what was actually a bad-argument rejection.
  • Threads. Real tids, ThreadJoin (the full QNX 8 error matrix — ESRCH/EINVAL/EDEADLK/EBUSY), and ThreadCancel/ThreadCtl with QNX's deferred-cancellation model: every blocking call is a cancellation point except MsgSendvnc and SyncMutexLock. A returning worker thread now becomes ThreadDestroy(self, retval) instead of jumping to ra=0 and faulting the kernel.
  • Timers. TimerCreate/Destroy/SetTime/Timeout, QNX shape, driven off the existing 1 kHz tick. Signals are pulses in QSOE, so a timer must name its pulse target — event == NULL is -EINVAL, there is no SIGALRM default. nanosleep rides TimerTimeout directly, with no taskman round-trip, because pure time needs no resource manager.
  • Signals as pulses, end to end, including SIGCHLD fired at the parent's signal channel on child exit — sent before the waiter release so a SIGCHLD handler that calls waitpid finds the zombie in place.
  • Process lifecycle. PROC_EXIT, real WEXITSTATUS, waitpid(-1)/WNOHANG, full end-of-process teardown: owned channels die hard (in-flight senders complete with -ESRVRFAULT instead of parking forever), connection pools drain, page tables and user frames return to the pool, children reparent to pid 1. Then the housekeeping that a long soak forces out — per-process mmap VA cursors (one global never-rewound cursor was burning ~17 MiB of anonymous window per suite lap), pool telemetry, and a round-robin pid recycler so TM_MAXPID caps simultaneous processes rather than lifetime spawns.

A couple of those were proper bugs. SyncSemPost had a lost wakeup — it either woke a parked waiter or incremented the count, never both, so the woken waiter's re-check saw count 0 and parked forever. It was the long-standing suite hang, only fully exposed once nanosleep really slept and the waiter genuinely parked. Fixed Mesa-style: publish the count first, then wake. And the sync objects were keyed by raw user VA, which was fine with one address space and a latent disaster with many — two processes parking on the same address (likely, since identical binaries lay out identically) shared one wait queue and one owner. The key grew an address-space component, with a teardown sweep that releases every sync object of a dying space.

When the last unimplemented syscall (SchedYield, which is exactly what lwkt_yield already did) got wired, the conformance suite completed end to end for the first time: 117 tests, 104 passing, 13 failing, zero panics, consistent across 25 consecutive runs on 8 harts. The remaining failures clustered into named work packages and got knocked down over the following days; the suite now soaks for 20-plus laps in a single boot with a flat page-pool floor.

A note on message size

One design choice runs through all of this and is worth pulling out, because it's where the QSOE/N and QSOE/L stories rhyme. A QNX-shape kernel does synchronous IPC; the framing I follow (Gernot Heiser's, on seL4) is that IPC is a protected procedure call for control flow, and bulk data should move out of band. seL4 takes that to a fixed, small message buffer — small enough that Andrew Warkentin forked it for UX/RT partly over the constraint. Skimmer's inline cap is 4 KiB. Past that, QSOE/N copies the payload page-by-page through the other side's page-table physical addresses — no aliased satp window, preemption-safe, no channel lock held — sound only because the sender is REPLY-parked for the whole rendezvous and Skimmer has no demand paging. A 16 MiB ceiling fails runaway lengths fast. It's the pragmatic middle: small messages stay a fast synchronous copy, large ones don't bounce through a scratch buffer, and neither needs shared-memory setup for the common case.

Where that leaves it

QSOE/N now boots to a shell, spawns and reaps processes, runs userland drivers as resource managers, and passes a POSIX conformance suite on repeat. It is an operating system. But every line of it had run only on QEMU, which is generous about a class of hardware behaviour it doesn't model at all.

The HiFive Unmatched on my desk is not generous. That's part 3.

Comments

Popular posts from this blog

How QSOE started

QSOE/L: the same userspace, on seL4