QSOE/N, part 1: proving the core (Skimmer v0.1–v0.5)

This is the first of three posts tracing QSOE/N from its first commit to its first boot on real hardware. This one is about the kernel alone — Skimmer — before there was any userspace to speak of. The whole point of v0.1 through v0.5 was to answer one question: is the core sound? Everything after it assumes the answer is yes.

I built all of it alongside Claude Code (Opus 4.7 for this stretch, 4.8 later). The commits carry the co-author line; I'm not going to repeat it every paragraph, but the race hunts below were genuinely a two-party effort.

Where the design comes from

Skimmer is a from-scratch microkernel for 64-bit RISC-V, Apache-2.0, written to a QNX shape. The concurrency model is not QNX's, though — it's DragonFly BSD's. The two ideas I took are Light Weight Kernel Threads (LWKT) and message ports: per-CPU run queues with no cross-CPU locks on the hot path, and cross-CPU work expressed as messages routed through per-CPU rings rather than as foreign writes into another core's state. That maps almost too neatly onto a QNX-style synchronous-IPC kernel, where the natural unit of work already is a message.

The other input is QRV — my three-month QNX-to-RISC-V port that wound down on May 26. QRV taught me where the bodies are buried on this architecture, and several Skimmer commits cite QRV commit hashes directly: the trap-entry sscratch design, the "clear SIE before restoring sstatus" ordering, the per-CPU lockless trace ring, the global queue-lock wart I decided to fix up front this time. Skimmer is not QRV — different kernel, different lineage — but it inherits QRV's scar tissue.

On licensing: the early lwkt_* files I studied from DragonFly were deleted before v0.1 and retyped fresh against my own headers in include/skimmer/. The design shows through; the code is mine, under Apache-2.0. Any future DragonFly imports will keep their BSD-3-Clause headers.

v0.1 — the core, and nothing else

The first release boots in S-mode under OpenSBI on QEMU virt and does deliberately almost nothing: no userspace, no paging, no trap path beyond a panic-spin. Eight harts come up via SBI HSM; each gets a per-hart LWKT scheduler and an sp-based cpu_switch. Thread-port message ports work (lwkt_sendmsg / lwkt_domsg / lwkt_replymsg), and cross-hart sends route through polled single-producer/single-consumer IPIQ rings. The demo is one server LWKT plus seven clients on separate harts doing fourteen synchronous rendezvous round-trips, all err=0.

That's about 1700 lines across 25 files. No real IPI yet (the IPIQ is polled), cooperative scheduling within a priority band, modern SBI only. The point was to prove that the LWKT + msgport substrate works on RISC-V before adding Sv39, an ELF loader, syscalls, and personas on top. It did.

v0.2 — the trap path, and five SMP races

v0.2 is where it got real: full S-mode trap entry/exit, strict-priority preemption, real RISC-V IPI (SSIP, retiring the polled drainer), and a stress workload called msgstorm — 8 servers × 8 clients × 128 rounds, 8192 synchronous round-trips.

It started at 0 passes out of 20.

The critical fix, which took it to 19/20, was this: every run-queue mutation — the local lwkt_schedule path, lwkt_yield, lwkt_deschedule_self, lwkt_exit, lwkt_setpri_self — has to wrap its TAILQ insert/remove in local_irq_save / local_irq_restore. Critical sections alone are not enough. A critical section defers the preempt-switch, but the IPI handler still runs, and it calls runq_insert via lwkt_schedule, splicing the list mid-step from the racing path. The corruption manifested as a cyclic TAILQ — "watcher → serv3 → watcher → serv3 → …" — which silently drops every other entry and starves the high-priority threads. The DragonFly-style critcount discipline in lwkt_switch (every thread crosses cpu_switch with critcount >= 1) closes the related preempt-during-switch class.

The methodology lessons from that week were harder to internalise than the fix, and they've held for everything since:

  • Never print from hot paths. The original storm sprayed a status line from every client on every round-trip. The resulting sbi_console_lock contention was the deadlock for many of the failing runs — the bug I was hunting and the instrumentation I was hunting it with were the same bug. All hot-path prints came out; diagnostics go to a per-CPU lockless trace ring, dumped only at quiescent points.
  • Change one variable at a time. Several parallel "fixes" in flight made it impossible to attribute any pass/fail change to any single edit. I reverted partial work more than once to recover a clean baseline.
  • Use gdb to look, not to infer. Two minutes with a sk-runq helper showed the cyclic list directly. Hours of reading code had not. The gdb helpers (sk-state, sk-pcpu, sk-runq, sk-thread, sk-pool, sk-storm) earned their place in the tree that week.

After that: 200/200, zero panics, zero hangs.

v0.3 — Sv39, U-mode, and the channel skeleton

v0.3 adds paging and the first user threads. A single master Sv39 page table for the whole system: kernel identity-mapped at 2 MiB megapages, the user image at 4 KiB pages with U=1, ASID=1 (never 0). The U-mode trap path extends the entry/exit with an sscratch-swap to a per-thread kernel stack and a unified return that picks sret-to-U vs sret-to-S from the saved SSTATUS.SPP.

Four syscalls land — THREAD_CREATE / THREAD_DESTROY / CHANNEL_CREATE / CHANNEL_DESTROY — and the channel skeleton goes in: three rendezvous queues (send / receive / reply) declared as dormant TAILQ heads, ready for MsgSend in v0.4, with a per-channel lock from day one (QRV used one global queue-lock; I fixed that on the first pass here). The chid is a generation-counter encoding so a stale or recycled id mismatches instead of aliasing the next occupant.

The U-mode test bootstraps 16 children, each running 1000 × 128 channel create/destroy with a Fisher-Yates-shuffled destroy order — 2 million channel ops per boot — with a completion gate that fails visibly on a single leaked channel. A kernel-side kstress workload runs a server/client ping-pong on separate harts the whole time, keeping the msgport and IPI paths under load. 50/50.

v0.4 — the interrupt path under stress

v0.4 is hardware bring-up inside QEMU, and the milestone where I proved the interrupt-under-stress path that QRV chronically failed at. It adds a hand-rolled FDT walker (no libfdt — a tag-state-machine that finds PLIC, PCI ECAM, and RAM bounds in one pass), a PLIC driver with priority-based masking and per-IRQ hart pinning (each source enabled on exactly one hart — no multi-hart delivery, no missed-IRQ races), an interrupt-service-thread (IST) infrastructure, MMIO mapping, and — temporarily in-kernel — a PCI enumerator and a ~500-line NVMe driver.

Two bugs worth keeping. The FDT work surfaced a latent head.S bug: boot arguments in a0/a1 were being saved after the BSS-zero loop had already wiped them. v0.1–v0.3 never read the FDT, so it was harmless; the moment v0.4 needed it, it wasn't. Fix: stash across the zero, write to .bss after. And the NVMe driver shipped with inverted phase-tag polarity — sixteen stray-CQE prints before a polling-timeout panic — fixed by flipping the XOR sense.

The acceptance run puts all four workloads on at once: msgstorm, the kstress ping-pong, the v0.3 U-mode channel storm, and continuous NVMe reads driven by a real PLIC IRQ on hart 3. 25/25 boots, mean ~27,600 NVMe reads per boot, zero panics, zero hangs, zero canary stomps.

v0.5 — QNX-shape synchronous IPC

v0.5 is the payoff: real synchronous MsgSend / MsgReceive / MsgReply over the three rendezvous queues v0.3 planted. The channel becomes a new lwkt_port implementation — the channel-port vtable does the three-queue rendezvous in about 50 lines of policy on top of the v0.1/v0.2 msgport substrate. ConnectAttach / ConnectDetach add the indirection so the U-mode ABI passes connection ids (coids), not raw chids — the QNX shape. Priority inheritance is immediate-only. The per-channel lock is held across both copy stages — an seL4-style per-object discipline, no SMR, no refcounting.

It passed acceptance, and then a roughly 1% intermittent failure showed up in post-acceptance hardening — the same shape as two QRV commits (7830786b, 97094dad). The window between gd->gd_curthread = new and cpu_switch(...) was preemptible against fresh U-threads whose td_critcount was still 0; a trap-tail preempt fired a nested lwkt_switch that saved the running sp into the new thread's td_sp, permanently corrupting it. The fix is one local_irq_save/restore bracket around the curthread-update-through-cpu_switch window. After it: 500/500 PASS, zero panics, zero hangs, running the heaviest workload — one U-server and sixteen U-clients × 1000 MsgSend, 16,000 messages — concurrently with the kernel stressers.

That race is also why the v0.5 work cross-references Elad Lahav's writeup of the QNX 8 interrupt model: an IST-only interrupt model demands a race-tight scheduler, and the REPLACE-race fix is what finally delivers one.

Where that leaves it

At v0.5, Skimmer is a kernel that boots eight harts, pages, takes interrupts under load, and does QNX-shape synchronous IPC between user processes, with a stress harness that has caught every SMP race I've thrown threads at. There is still no libc, no real userspace, no procnto-equivalent. The core is proven; it is not yet an operating system.

That's the next post.

Comments

Popular posts from this blog

How QSOE started

QSOE/L: the same userspace, on seL4

QSOE/N, part 2: from a kernel to an operating system