QSOE/N, part 2: from a kernel to an operating system
Part 1 ended with Skimmer at v0.5 — a sound microkernel with QNX-shape synchronous IPC, and nothing running on top of it. This post is the rest: how Skimmer became QSOE/N, an actual operating system with processes, a C library, drivers, and a conformance suite that passes a full lap. All of it still in QEMU; real hardware is part 3.
The umbrella
QSOE — "Quick and Secure Operating Environment" — is a QNX-Neutrino-style operating environment for 64-bit RISC-V, built entirely under Apache-2.0. It comes in two variants that share almost everything:
- QSOE/N (project NQ), the native-kernel variant, running on Skimmer.
- QSOE/L (project LQ), running the same userspace on the seL4 microkernel.
The split is deliberately thin. The kernel differs, the system process (taskman) differs, and a small OS-dependent slice of libc differs. Everything else — the C library body, the drivers, the utilities — is shared between the two systems. That constraint shaped a lot of the decisions below, because a wire format or an API choice that only made sense on Skimmer would have broken the shared half.
On May 30 I migrated Skimmer into the umbrella as QSOE/N. The repository was restructured into the layers it still has: include/skimmer/ (the kernel headers), kernel/lwkt/ (the DragonFly-inspired substrate — the heart), kernel/arch/riscv/ (the only place that names "riscv", Sv39, or SBI), kernel/ (arch-neutral top level), taskman/, the libc seam, and the test suite. The principle the layering encodes: per-hart state, no cross-CPU locks on the hot path, cross-hart work flows as messages. A future port — x86-64 is the obvious one — lands as a sibling kernel/arch/ directory and nothing above it should care.
taskman — the procnto analogue
taskman is QSOE/N's answer to QNX's procnto: a single statically-linked U-mode binary, loaded from the boot CPIO archive by the kernel and spawned before any other user thread. It owns the well-known channel (TASKMAN_CHID, encoded (gen=1, idx=1) = 65537) and runs a single-threaded MsgReceive → dispatch → MsgReply loop — the same shape the QSOE/L taskman runs on seL4. It owns process creation, the path namespace, memory mapping, credentials, and waitpid; it is not in the data path for ordinary file I/O, which matters a lot in the next section.
There is, by design, no SYS_SPAWN syscall. Ever. Process creation is taskman's TM_REQ_SPAWN, which composes kernel primitives: create a VSpace, map zero or more regions into it, create a thread in it. The kernel provides mechanism; taskman provides the policy that adds up to a process.
fd === coid: the resource-manager wire
This is the architectural centre of QSOE/N, and the thing I'm most pleased with.
In QSOE/N a POSIX file descriptor is a kernel connection id. coid is the slot index in the calling process's connection pool; the integer you get back from open() is that coid handed straight back as the fd. Every per-fd operation is then a single MsgSend directly to the channel that owns the connection:
open()canonicalises the path, asks pathmgr once for(server_pid, server_chid),ConnectAttaches, and sends_IO_CONNECTso the server can build its per-coid OCB. The returned coid is the fd.read()/write()/fstat()/lseek()/close()are each oneMsgSend(fd, _IO_*)straight to the resource manager.
taskman is no longer in the read/write/close/fstat/lseek path at all. It answers the structural calls — open, spawn, waitpid, mmap, pathmgr registration — and gets out of the way for the bytes. dup is ConnectServerInfo (QNX shape — it resolves a coid to its (nd, pid, chid, scoid) tuple) followed by a fresh ConnectAttach at the requested slot. fcntl(F_GETFD/F_SETFD) goes to a kernel syscall directly. The whole universal-forwarder scaffolding that an earlier draft of taskman carried — forward_io_read, the per-process fd table, the handle_fcntl/handle_dup opcodes — got deleted once this landed, because nothing routed through it anymore.
The wire itself is unified to a single qsoe_ipcbuf layout shared with QSOE/L: a tag in the first eight bytes, four scalar words, then variable payload. The typed per-request structs went away. Making the wire the same shape on both kernels is what keeps the shared-userspace promise honest.
Getting ConnectServerInfo right was a real bug, not a refactor. The QNX 8 contract is that it returns the matched coid (a miss scans upward for the next-higher live connection), and info.pid carries the channel owner's real pid. libc's F_DUPFD feeds that (pid, chid) straight back into ConnectAttach. While the kernel still gated ConnectAttach on "pid must be 0", the now-real pid made every dup fail — which surfaced as qsh being unable to open /sbin/init. The fix was to teach ConnectAttach the QNX (pid, chid) addressing it always should have had.
Multi-process: per-process VSpaces and spawn
For all of v0.1–v0.5 there was one address space. Real processes need their own. The kernel gained a per-process VSpace allocator (struct sk_vspace — root PT, ASID, refcount), td_vspace plumbed through lwkt_switch so satp + sfence land on any context switch that crosses VSpaces (a plain pointer compare; same-VSpace switches stay zero-cost), and a set of privileged TM_PRIV_VSPACE_* ops for taskman to drive.
The spawn flow is a small dance. taskman creates a fresh child VSpace, then loads the image — ELF segments, the dynamic loader, libc.so, stack, TCB — into its own boot page table, where direct stores for memcpy and relocations Just Work, and then hands ownership of the loaded pages to the child via TM_PRIV_VSPACE_TRANSPLANT. Transplant walks each leaf in the VA window, installs the same PA at the same VA in the child's root preserving perms, and zeros taskman's slot. The pages aren't freed — they just belong to the child now, and taskman's load VAs are free again for the next spawn. Then a thread is created against the child VSpace and runs.
One preparatory sweep deserves a mention because it's invisible until it isn't: every kernel-side dereference of a physical address had to move from an identity-low mapping to the canonical-high mapping that every VSpace shares (the analogue of QRV's PHYS_TO_PTR sweep). While there was one address space, identity-low and high-half resolved to the same bytes and it didn't matter. The moment a per-process VSpace switch happens, only the high-half path stays valid. Doing that sweep before the spawn flow landed is the difference between a clean bring-up and a week of "why is the kernel reading user memory."
End to end: /sbin/init spawns slogger, a PCI server, a serial driver, and qsh into four distinct VSpaces.
Drivers leave the kernel
The in-kernel PCI scanner and NVMe driver from v0.4 were always meant to be temporary — bring-up convenience. Nothing about either needs privileged mode, and keeping them in the kernel cut against the whole stance: the kernel is the absolute minimum, drivers are resource managers in U-mode. So both got deleted from the kernel. PCI moved to a userland ECAM library any resmgr can link; NVMe is headed for a devb-nvme resmgr.
After that, the kernel owns exactly: harts and threads, VSpaces and Sv39 page tables, channels and per-process connection pools, the PLIC and IST glue, the FDT walker, and the SBI console seam. No device drivers. That's the line I want to hold.
The POSIX surface, and the conformance suite
With multi-process working, the rest was filling in the system-call surface and proving it against a conformance suite that runs as an ordinary user program. Highlights:
- One error vocabulary. The old dense internal error dialect (
SK_E*) was retired in favour of negated POSIX-shape errno values, Linux-style (return -EINVAL;), single-sourced in a kernel errno header that mirrors the libc ABI and is pinned by a_Static_assertso the two can't drift. This was prompted by a real debugging cost: an internal-2traveled through the seam andstrerrorreported "No such file or directory" for what was actually a bad-argument rejection. - Threads. Real tids,
ThreadJoin(the full QNX 8 error matrix —ESRCH/EINVAL/EDEADLK/EBUSY), andThreadCancel/ThreadCtlwith QNX's deferred-cancellation model: every blocking call is a cancellation point exceptMsgSendvncandSyncMutexLock. A returning worker thread now becomesThreadDestroy(self, retval)instead of jumping tora=0and faulting the kernel. - Timers.
TimerCreate/Destroy/SetTime/Timeout, QNX shape, driven off the existing 1 kHz tick. Signals are pulses in QSOE, so a timer must name its pulse target —event == NULLis-EINVAL, there is noSIGALRMdefault.nanosleepridesTimerTimeoutdirectly, with no taskman round-trip, because pure time needs no resource manager. - Signals as pulses, end to end, including
SIGCHLDfired at the parent's signal channel on child exit — sent before the waiter release so aSIGCHLDhandler that callswaitpidfinds the zombie in place. - Process lifecycle.
PROC_EXIT, realWEXITSTATUS,waitpid(-1)/WNOHANG, full end-of-process teardown: owned channels die hard (in-flight senders complete with-ESRVRFAULTinstead of parking forever), connection pools drain, page tables and user frames return to the pool, children reparent to pid 1. Then the housekeeping that a long soak forces out — per-process mmap VA cursors (one global never-rewound cursor was burning ~17 MiB of anonymous window per suite lap), pool telemetry, and a round-robin pid recycler soTM_MAXPIDcaps simultaneous processes rather than lifetime spawns.
A couple of those were proper bugs. SyncSemPost had a lost wakeup — it either woke a parked waiter or incremented the count, never both, so the woken waiter's re-check saw count 0 and parked forever. It was the long-standing suite hang, only fully exposed once nanosleep really slept and the waiter genuinely parked. Fixed Mesa-style: publish the count first, then wake. And the sync objects were keyed by raw user VA, which was fine with one address space and a latent disaster with many — two processes parking on the same address (likely, since identical binaries lay out identically) shared one wait queue and one owner. The key grew an address-space component, with a teardown sweep that releases every sync object of a dying space.
When the last unimplemented syscall (SchedYield, which is exactly what lwkt_yield already did) got wired, the conformance suite completed end to end for the first time: 117 tests, 104 passing, 13 failing, zero panics, consistent across 25 consecutive runs on 8 harts. The remaining failures clustered into named work packages and got knocked down over the following days; the suite now soaks for 20-plus laps in a single boot with a flat page-pool floor.
A note on message size
One design choice runs through all of this and is worth pulling out, because it's where the QSOE/N and QSOE/L stories rhyme. A QNX-shape kernel does synchronous IPC; the framing I follow (Gernot Heiser's, on seL4) is that IPC is a protected procedure call for control flow, and bulk data should move out of band. seL4 takes that to a fixed, small message buffer — small enough that Andrew Warkentin forked it for UX/RT partly over the constraint. Skimmer's inline cap is 4 KiB. Past that, QSOE/N copies the payload page-by-page through the other side's page-table physical addresses — no aliased satp window, preemption-safe, no channel lock held — sound only because the sender is REPLY-parked for the whole rendezvous and Skimmer has no demand paging. A 16 MiB ceiling fails runaway lengths fast. It's the pragmatic middle: small messages stay a fast synchronous copy, large ones don't bounce through a scratch buffer, and neither needs shared-memory setup for the common case.
Where that leaves it
QSOE/N now boots to a shell, spawns and reaps processes, runs userland drivers as resource managers, and passes a POSIX conformance suite on repeat. It is an operating system. But every line of it had run only on QEMU, which is generous about a class of hardware behaviour it doesn't model at all.
The HiFive Unmatched on my desk is not generous. That's part 3.
Comments
Post a Comment