E2E CI — Plain Install Test#
Skill for the plain composefs install E2E gate in build-iso.yml.
Architecture#
The E2E gate splits into four named CI steps, each with its own timeout:
| Step | Timeout | RAM | Purpose |
|---|---|---|---|
E2E 1/4 — Boot live ISO | 10 min | 4 GiB | Live env boots, sshd responds |
E2E 2/4 — ENOSPC gate: OCI export only | 10 min | 4 GiB | skopeo copies blob without ENOSPC |
E2E 3/4 — Full install composefs | 60 min | 8 GiB | btrfs+composefs install completes |
E2E 4/4 — Boot installed + verify Graphical | 15 min | 8 GiB | Installed system reaches Graphical target |
Why split RAM between steps:
The ENOSPC bug triggers when overlay tmpfs is ~2 GiB (4 GiB RAM).
The bootc install to-filesystem --composefs-backend step extracts 6 GB via
btrfs+overlay through QEMU — inherently slow at 4 GiB, ~3× faster at 8 GiB.
The debug ISO#
The production ISO (uploaded to R2) has sshd disabled — it is built
with DEBUG=0. The E2E test uses a debug ISO that is patched at CI time:
unsquashfs production.rootfs.sfs → debug-rootfs/
# add sshd.service symlink, password auth, liveuser:live password
mksquashfs debug-rootfs → debug.rootfs.sfs (zstd-1, fast)
build-iso.sh → output/dakota-debug-live.iso
plain-boot-qemu-live prefers output/{{target}}-debug-live.iso when
present, so CI uses the debug ISO and R2 gets the production ISO.
Never enable sshd in the production squashfs. The debug ISO is test-only.
QEMU disk: raw sparse + cache=unsafe#
Use raw sparse disk instead of qcow2 for sequential write workloads:
truncate -s 64G /var/tmp/dakota-plain-install.img # not qemu-img create -f qcow2
QEMU drive args:
-drive if=none,id=disk,file={{plain-qemu-disk}},format=raw,cache=unsafe
cache=unsafe skips fsync entirely. Fine for tests — not for production.
Raw sparse + cache=unsafe delivers 200–500 MB/s vs ~10–50 MB/s for qcow2.
ENOSPC root cause and fix#
Symptom: write /var/tmp/container_images_XXXXXXXX: no space left on device
Root cause chain:
containers/imageTypeBigFiles path callsstore.TmpDir()firststore.TmpDir()in containers/storage returns/var/tmp(hardcoded default)/var/tmpis on the dracut overlayfs (~1.4 GiB at 4 GiB RAM)- A single squashed 5–6 GiB OCI layer blob overflows it
Why TMPDIR env var doesn't help:
TypeBigFiles checks store.TmpDir() BEFORE os.Getenv("TMPDIR"). If the
store returns a non-empty string, TMPDIR is ignored entirely.
Why CONTAINERS_STORAGE_CONF with tmpdir = doesn't help:
Older containers/storage versions don't have tmpdir as a recognized TOML
field — they silently reject it with Failed to decode the keys ["storage.tmpdir"].
The fix that works (fisherman v2.7.3):
Bind-mount a disk-backed scratch subdir over /var/tmp before the skopeo copy,
then umount it in a deferred call:
varTmpOverride := filepath.Join(tmpdir, "var-tmp-override")
os.MkdirAll(varTmpOverride, 0o1777)
exec.Command("mount", "--bind", varTmpOverride, "/var/tmp").Run()
defer exec.Command("umount", "/var/tmp").Run()
// now skopeo copy runs — /var/tmp is disk-backed
This works across all containers/storage versions and is independent of env
vars, config keys, or source transport.
flatpak-spawn does not forward sandbox env to host#
Symptom: fisherman sets TMPDIR=/scratch and prints # TMPDIR=... before
running skopeo, but blobs still land in /var/tmp.
Root cause: When running inside a Flatpak, runner.HostArgs wraps the
command as flatpak-spawn --host skopeo .... flatpak-spawn --host spawns the
command in the host mount namespace but does NOT forward the Flatpak sandbox's
environment. cmd.Env applies to flatpak-spawn itself, not to skopeo.
Fix: Use runner.HostArgsWithEnv(name, args, envVars) which injects
--env=KEY=VALUE flags into the flatpak-spawn args when inside a Flatpak.
For non-Flatpak invocations the result is identical to HostArgs.
How to identify: # TMPDIR=<path> in fisherman output followed by
write /var/tmp/container_images_...: no space left on device. The TMPDIR
debug print confirms fisherman set the var; skopeo ignoring it confirms the
flatpak-spawn env forwarding gap.
sshd is only enabled in debug ISOs#
Symptom: kex_exchange_identification: read: Connection reset by peer OR
ERROR: serial marker seen but SSH not ready after 90 s
Root cause: configure-live.sh only enables sshd when DEBUG=1 is passed
as a build arg. Production ISOs have no sshd. QEMU user-mode networking accepts
the TCP connection on port 2223 and then resets it because port 22 is not open
inside the guest.
Fix: Build a debug ISO (see above) for E2E testing. Never change the
production ISO to enable sshd.
Serial marker fires before sshd is stable#
Symptom: Serial marker seen — polling SSH... then sshd resets
connections for 10–20 s before accepting.
Cause: live-ready.service fires After=display-manager.service,
which is before sshd finishes host-key generation. First connection after
serial marker gets kex_exchange_identification: read: Connection reset by peer.
Fix in plain-boot-qemu-live: After seeing the serial marker, poll SSH
in a 3-second retry loop (up to 90 s) before breaking out of the wait loop.
This ensures sshd is stable before plain-install-qemu SSHes in.
SUPERISO_TMPDIR — squashfs build must use large disk#
Symptom: Build live squashfs + boot tar fails with ENOSPC.
Root cause: build-live-squashfs.sh creates its WORK dir at /var/tmp
by default. The squash-to-1-layer + VFS embedding writes ~12 GB of
intermediates; /var/tmp on ubuntu-24.04 CI runners has ~14 GB total.
If the image grows, this overflows.
Fix: Set SUPERISO_TMPDIR=/var/iso-build (119 GB) in the workflow env:
- name: Build live squashfs + boot tar
env:
SUPERISO_COMPRESSION: release
SUPERISO_TMPDIR: /var/iso-build
fisherman hostname failure after composefs install#
Symptom: E2E 3/4 fails with:
fisherman: fatal: writing hostname: finding deployment dir: ostree admin --print-current-dir: exit status 1
Root cause: When bootc install to-filesystem --composefs-backend runs, it
creates ostree/bootc/ on the target disk instead of the traditional
ostree/deploy/default/. In step 7 ("Configuring installed system"), fisherman
calls ostree admin --print-current-dir to locate the deployment directory for
writing /etc/hostname. That command fails because there is no traditional ostree
deployment — composefs uses a different on-disk layout.
The actual bootc installation (step 5) completes successfully. Only the
post-install hostname-writing step fails.
Evidence from CI log:
{"message":"bootc installation complete",...} ← step 5 complete
{"message":"Copying system Flatpaks",...} ← step 6 complete
{"message":"Writing hostname: dakota-plain-test"} ← step 7 begins
+ ls /mnt/fisherman-target/ostree
bootc ← no deploy/ dir
fisherman: fatal: writing hostname: finding deployment dir: ostree admin --print-current-dir: exit status 1
Fix: continue-on-error: true on E2E 3/4 so ISO upload proceeds. A final
"E2E full-install status" step re-fails the job to preserve the visible red CI
status. E2E 4/4 is skipped when 3/4 fails (no installed system to boot).
Upstream: tuna-os/fisherman is archived. File against tuna-os/tuna-installer if needed.
What to do when this fires:
- Check that
bootc installation completeappears in the E2E 3/4 logs - If yes: it is the known fisherman hostname bug — ISO is OK to publish
- If no: a real install failure — do not publish, investigate