CI/CD#
How the GitHub Actions workflows build, test, and publish Dakota ISOs.
Workflows#
| Workflow | File | Trigger |
|---|---|---|
| Build & Publish | build-iso.yml | push to main, daily 03:00 UTC, workflow_dispatch |
| LUKS E2E Test | test-luks-install.yml | PRs to main, weekly Mon 04:00 UTC, workflow_dispatch |
| ShellCheck Lint | lint.yml | PRs to main, push to main |
| Python Unit Tests | test.yml | PRs to main, push to main |
build-iso.yml#
Job: build-and-publish (single job, no matrix)
Runner: ubuntu-24.04
Runs as: root via sudo
Path triggers: live/**, scripts/**, .github/workflows/build-iso.yml
Pipeline steps#
- Free disk space —
jlumbroso/free-disk-spacereclaims ~119 GB at/var/iso-build - Install deps —
apt-get install podman buildah skopeo mtools xorriso squashfs-tools dosfstools isomd5sum - Log in to GHCR —
sudo podman login ghcr.io - Pull payload image — pulls only
dakota-nvidia:stable(the unified ISO base) - Build live container —
podman build live/ --build-arg TARGET=dakota-nvidia→localhost/dakota-nvidia-live:latest - Build live squashfs —
scripts/build-live-squashfs.shwithSUPERISO_COMPRESSION=release→dakota-nvidia.rootfs.sfs+dakota-nvidia-boot.tar(~5.3 GB) - Assemble ISO —
live/src/build-iso.sh→dakota-live.iso(no--storeflag — OCI already embedded in squashfs as VFS) - Generate checksum — dated + latest variants
- Upload to R2 —
dakota-live-YYYYMMDD-<sha>.iso+dakota-live-latest.iso+ checksums - Boot verification — QEMU UEFI boot, wait for
DAKOTA_LIVE_READYserial marker - Upload artifacts — ISO + checksum + screenshot (7-day retention)
⚠️ Do not add
--storeback or re-add the offline store squashfs step.
The OCI image is already embedded in the live squashfs via VFS containers-storage.
Building a separatestore.squashfs.imgdoubles the OCI payload, producing an ~8 GB
ISO instead of ~5.3 GB. See lessons below.
⚠️ installer_channel is locked to stable in CI#
Do NOT change installer_channel to dev in the live container build. There is an active
regression in the dev channel (tuna-os/fisherman#38) where the overlay storage
code path fails with:
open /var/tmp/oci-cache/index.json: no such file or directory
Production CI must stay on installer_channel=stable until the regression is fixed.
Disk layout in CI#
The build path is /var/iso-build (~119 GB free after disk-space action).
Peak usage ~22 GB (live squashfs ~6 GB + offline store ~6 GB + ISO ~5 GB + intermediate).
No XFS loopback needed in CI.
Boot verification logic#
CI accepts either:
DAKOTA_LIVE_READYwritten directly to/dev/ttyS0bylive-ready.serviceFinished live-ready.servicein the serial log (systemd journal console fallback)
Some dev channel builds don't write the serial marker but still reach GDM.
If both checks fail after 5 minutes, the job fails with tail -50 /tmp/serial.log.
R2 upload#
ISOs are uploaded to the testing bucket as:
dakota-live-YYYYMMDD-<sha>.iso— permanent dated recorddakota-live-latest.iso— always points to the last successful build- Matching
-CHECKSUMfiles for both
⚠️ Direct uploads from the local host hang (routing issue). Always use R2→R2
server-side copies via rclone for local promotion. See docs/r2-promotion.md.
test-luks-install.yml#
Matrix: installer_channel: [dev, stable] (fail-fast: false)
Timeout: 90 minutes
Triggers: PRs to main, weekly schedule, workflow_dispatch
Pipeline steps#
- Ensure
ci-screenshotsbranch exists - Free disk space
- Install deps (adds
qemu-system-x86 ovmf socat sshpass) - Configure podman storage (
configure_podman_storage.sh) - Build ISO with
debug=1and the matrixinstaller_channel - Boot live ISO in QEMU (daemonized) + wait for ready
- SSH into live env, write recipe, run
fishermanLUKS install - Patch BLS entries for dual console (
console=tty0 console=ttyS0) - Boot installed disk, send LUKS passphrase via QEMU monitor
- Verify boot success via serial log
- Save screenshots to
ci-screenshotsbranch + post PR comment
Configure podman storage script#
.github/scripts/configure_podman_storage.sh — intelligently selects the storage
driver based on the host filesystem:
- Clears existing podman storage to avoid driver mismatch errors
- On BTRFS: uses VFS driver (overlayfs is unreliable on BTRFS in CI)
- On ext4/other: uses overlay driver
Screenshots#
LUKS test screenshots are saved to the ci-screenshots branch and linked in PR
comments. Key screenshots:
- Live boot (after
DAKOTA_LIVE_READY) - Plymouth LUKS passphrase prompt
- Final boot (after passphrase unlock)
Adding a new workflow#
All workflow files go in .github/workflows/. Before adding:
- Run
actionlint(config in.github/actionlint.yaml) - Check matrix
fail-fast: falsefor variant builds - Do not use
installer_channel=devin scheduled/release builds
lint.yml — ShellCheck#
Runs ShellCheck on every .sh file in the repository. Severity threshold is warning
(style/info is ignored). Uses ludeeus/action-shellcheck@2.0.0.
Any new shell script must pass ShellCheck before merge. For intentional suppression,
add an inline # shellcheck disable=SCxxxx comment with a justification.
test.yml — Python Unit Tests#
Runs pytest tests/ -v against Python 3.11.
| File | Tests | Coverage |
|---|---|---|
tests/test_luks_unlock.py | 52 | luks-unlock.py virsh/QEMU interaction, screenshot parsing, passphrase injection |
tests/test_multi_arch_iso.py | 4 | build-iso.sh --arch arg parsing; integration tests (skipped if xorriso/mtools absent) |
Run locally with:
pip install pytest
pytest tests/ -v
Lessons#
Double-embedded OCI store inflates ISO to 8 GB (2026-06)#
The live container (live/Containerfile) already bakes the OCI image into the squashfs
as VFS containers-storage via configure-live.sh and install-flatpaks.sh.
Building a separate store.squashfs.img with scripts/build-offline-store.sh and
passing --store store.squashfs.img to build-iso.sh embeds the same ~4 GB OCI
image twice in the final ISO — resulting in ~8 GB instead of ~5.3 GB.
Fix: Remove the "Build offline image store squashfs" CI step entirely.
Call build-iso.sh without --store. This is the correct architecture for the
unified VFS-embedded ISO.
Release compression for production ISOs (2026-06)#
scripts/build-live-squashfs.sh defaults to zstd level 3 (SUPERISO_COMPRESSION=fast).
CI sets SUPERISO_COMPRESSION=release (zstd-15, 1M blocks) in the squashfs build step —
this produces ~20% smaller ISOs at ~5× longer squashfs build time. Always use release
for ISOs published to R2. Use fast only for local testing.
installer_channel=dev regression: oci-cache/index.json not found (2026-05)#
After the continuous-dev release ~2026-05, fisherman's overlay storage path fails
with open /var/tmp/oci-cache/index.json: no such file or directory when composefs+btrfs
is the backend. Root cause: fisherman exports the OCI to scratch but bootc inside the
container cannot see it via the bind mount.
Fix: use installer_channel=stable. Keep build-iso.yml on stable until
tuna-os/fisherman#38 is resolved.
DAKOTA_LIVE_READY not seen when live-ready.service uses journal+console (2026-05)#
When StandardOutput=journal+console, the output goes to /dev/console (not /dev/ttyS0).
QEMU serial (-serial file:...) captures ttyS0 output only.
Fix: StandardOutput=tty + TTYPath=/dev/ttyS0 for direct serial writes.
CI falls back to SSH connectivity check if the marker is absent.
Offline install failed: VFS containers-storage missing from CI ISOs (2026-06)#
Symptom: fisherman: fatal: ... reference "containers-storage:ghcr.io/projectbluefin/dakota-nvidia:stable" does not resolve to an image ID
Root cause: The CI build called scripts/build-live-squashfs.sh without --oci-image, so
the live squashfs shipped with an empty /var/lib/containers/storage. The live recipe.json
has local_imgref=containers-storage:ghcr.io/projectbluefin/dakota-nvidia:stable, which
fisherman treats as authoritative. When the local store is empty, the install fails even if
the user has a working internet connection (fisherman does not fall back to docker://).
Why local builds were unaffected: just iso-sd-boot always did the squash+skopeo step and
baked the OCI into the squashfs. CI diverged from this path and the gap was undetected
because CI only validated boot, not install.
Fix: scripts/build-live-squashfs.sh now accepts --oci-image <ref>. When provided it:
- Squashes the payload to a single layer via
buildah commit --squash - Runs
skopeo copyinside the live container (for JSON tar-split compatibility) - Copies the populated VFS staging dir into the squashfs root with
cp -abefore mksquashfs
build-iso.yml now passes --oci-image ghcr.io/projectbluefin/dakota-nvidia:stable and
asserts the embedded store is non-empty before uploading the ISO to R2.
Invariant: The CI-built squashfs must contain a populated VFS store at
var/lib/containers/storage with the dakota-nvidia:stable image. The assertion step
catches any regression before upload.
See issue #78.
VFS store not captured by mksquashfs when using bind-mount into overlayfs (2026-06)#
Symptom: build-live-squashfs.sh --oci-image runs successfully, VFS store logs 9.1G, but
the squashfs is only ~4.2G (no VFS data) and the assertion fails.
Root cause: When SFS_ROOT is an overlayfs mount (the default on ext4/XFS CI runners),
the overlayfs filesystem has a different st_dev than a bind-mounted directory inside it.
mksquashfs respects filesystem boundaries (stops when st_dev changes) and silently skips
the bind-mounted VFS tree.
Fix: Copy the VFS staging dir into the squashfs root with cp -a instead of bind-mounting.
Writes to an overlayfs path go into the overlay upper layer; the resulting files inherit the
overlayfs st_dev and are included by mksquashfs.
Rule: Never use mount --bind to inject data into a directory that will be squash-packed with
mksquashfs when the mount point is overlayfs. Always copy.
VFS storage paths don't contain image names — assertion must check vfs-images/ (2026-06)#
Symptom: Assertion grep -c "ghcr.io" on unsquashfs -lc output returns 0 even when the
OCI store is correctly embedded.
Root cause: VFS containers-storage uses content-addressed hashes for all paths. Image
names like ghcr.io/projectbluefin/... are stored in JSON metadata inside the hash-named
directories, not in the directory paths themselves. unsquashfs -lc shows file paths only,
so grepping for ghcr.io always returns 0.
Correct assertion: Check for var/lib/containers/storage/vfs-images — this directory is
created by containers/storage for every imported image. If it has entries, the VFS store
was populated.
Note on mksquashfs deduplication: The VFS layer is a squashed copy of the same OS as the
live rootfs. mksquashfs deduplicates identical content blocks, so the squashfs size barely
increases despite embedding 9G of VFS data. Use inode/file counts to confirm inclusion,
not squashfs file size.
ENOSPC in skopeo OCI export — containers/storage tmpdir not redirected (2026-06)#
Symptom: Live ISO installs fail with:
reading blob sha256:...: write /var/tmp/container_images_XXXXXXXX: no space left on device
The installer correctly sets TMPDIR=/mnt/fisherman-target/.fisherman-scratch but the
blob staging file still lands at /var/tmp.
Root cause (3 layers):
configure-live.shwrites/etc/containers/storage.confwithdriver = "vfs"but no
tmpdirline.containers/storagedefaultsTMPDirto/var/tmp(hardcoded) when the config has no
tmpdirfield. Setting$TMPDIRin the subprocess env is not sufficient — containers/storage
reads the store config first and uses/var/tmpas the unconditional fallback./var/tmpon the live ISO is on the dracut overlayfs (~1.4 GiB writable layer) — too small
for multi-GiB OCI layer blobs.
Fix: skopeoExportOCI (fisherman) now reads the current effective storage.conf, injects
tmpdir = "<scratchDir>", writes the result to a temp file in the disk-backed scratch dir,
and passes it to skopeo via CONTAINERS_STORAGE_CONF. $TMPDIR is retained for belt-and-
suspenders coverage of containers/image's copy-side blob staging.
Why CI didn't catch it: The LUKS E2E test runs QEMU with 8 GiB RAM; the overlay tmpfs
is ~4 GiB — large enough for individual blobs in most runs. On 8 GiB user laptops with the
live environment loaded, free tmpfs headroom is much lower and ENOSPC triggers reliably.
Prevention: plain-test-qemu (new) runs with qemu-mem=4096 (4 GiB RAM), which gives
only ~2 GiB overlay tmpfs — reliably reproducing this class of bug. The test is gated
before R2 upload in build-iso.yml.
build-live-squashfs.sh WORK dir must be on large disk (2026-06)#
Symptom: Build live squashfs + boot tar step fails with:
write /usr/lib/locale/.../LC_COLLATE: no space left on device
mkdir /vfs-storage/vfs-layers/tmp: no space left on device
Root cause: build-live-squashfs.sh creates WORK at /var/tmp by default.
The squash-to-1-layer + VFS embedding writes ~12 GB of intermediates (payload.oci.tar
~6 GB + VFS staging ~6 GB). /var/tmp on GitHub ubuntu-24.04 runners sits on the
root filesystem which has ~14 GB free after jlumbroso/free-disk-space — not enough
if the image grows at all.
Fix: WORK now uses ${SUPERISO_TMPDIR:-/var/tmp}. In CI, build-iso.yml
sets SUPERISO_TMPDIR: /var/iso-build so all intermediates land on the 119 GB
disk-backed path. Locally the default /var/tmp still applies.
Prevention: If squashfs build ENOSPC recurs in CI, verify SUPERISO_TMPDIR
is set in the Build live squashfs + boot tar step env.
E2E plain install test requires sshd — production ISO has it disabled (2026-06)#
Symptom: Plain install E2E step fails with either:
kex_exchange_identification: read: Connection reset by peer(QEMU user-net accepts TCP, no listener inside guest)ERROR: serial marker seen but SSH not ready after 90 s(sshd never starts)
Root cause: sshd is only enabled in the live ISO when the container is built with
--build-arg DEBUG=1. The production build uses DEBUG=0, so no sshd. The E2E test
uses SSH to invoke fisherman; without sshd the test cannot proceed.
Fix: After building the production ISO, a CI step patches the production squashfs:
unsquashfsthe production rootfs (includes the embedded VFS store)- Add sshd.service symlink to
multi-user.target.wants - Append
PasswordAuthentication yes/PermitEmptyPasswords yesto sshd_config - Set
liveuserpassword tolivevia/etc/shadowpatch mksquashfsback with zstd-1 (fast, debug-only)- Assemble
output/dakota-debug-live.iso(uses same boot tar as production)
plain-boot-qemu-live in the justfile prefers output/{{target}}-debug-live.iso
when present, so CI runs against the debug ISO while R2 gets the production ISO.
Why the VFS store must stay: ghcr.io/projectbluefin/dakota-nvidia is private;
the live env inside QEMU has no GHCR credentials, so fisherman cannot pull from
network. The VFS store (embedded in the squashfs) is the only install source.
flatpak-spawn --host does not forward sandbox env to host process (2026-06)#
Symptom: fisherman sets TMPDIR=/mnt/fisherman-target/.fisherman-scratch and prints
# TMPDIR=<scratch> before running skopeo, but the blob staging file is still created
at /var/tmp/container_images_XXXXXXXX causing ENOSPC.
Root cause: The bootc-installer runs inside a Flatpak. When runner.go calls
flatpak-spawn --host skopeo copy ..., flatpak-spawn --host spawns the command in
the HOST mount namespace but does not automatically forward the Flatpak sandbox
environment to the spawned host process. skopeo inherits the host's default env
(no TMPDIR set) and uses /var/tmp for blob staging.
Setting cmd.Env for the flatpak-spawn subprocess propagates env vars to
flatpak-spawn itself, but not to the command it spawns on the host.
Fix (fisherman): runner.HostArgsWithEnv injects critical env vars via
--env=KEY=VALUE flags in the flatpak-spawn args:
flatpak-spawn --host --env=TMPDIR=/scratch --env=CONTAINERS_STORAGE_CONF=... skopeo copy ...
Released in bootc-installer v2.7.1.
How to identify: Look for # TMPDIR=<path> in fisherman output followed by
write /var/tmp/container_images_...: no space left on device. The TMPDIR debug
print confirms fisherman set the var correctly, but skopeo ignoring it confirms the
flatpak-spawn env forwarding gap.
E2E test split into 4 named steps with individual timeouts (2026-06)#
Why: The original single plain-test-qemu step had one monolithic timeout (90 min).
When it expired you had no idea which of the four stages (boot-live, install,
boot-installed, verify) was the bottleneck.
New structure:
| Step | Timeout | RAM | Purpose |
|---|---|---|---|
E2E 1/4 — Boot live ISO | 10 min | 4 GiB | Live env ready + SSH confirmed |
E2E 2/4 — Install composefs | 30 min | 4 GiB | ENOSPC regression gate (tight tmpfs) |
E2E 3/4 — Boot installed disk | 10 min | 8 GiB | Installed system POSTs correctly |
E2E 4/4 — Verify Graphical target | 10 min | 8 GiB | systemd Graphical target reached |
Total worst-case ceiling: 60 min (vs. 90 min monolithic), with precise attribution.
Gate 1+2 use 4 GiB to keep the overlay tmpfs tight (~2 GiB) for ENOSPC testing.
Gate 2 switches to 8 GiB for realistic boot performance.