ci
Type
External
Status
Published
Created
Jun 13, 2026
Updated
Jun 13, 2026
Source
View

CI/CD#

How the GitHub Actions workflows build, test, and publish Dakota ISOs.

Workflows#

WorkflowFileTrigger
Build & Publishbuild-iso.ymlpush to main, daily 03:00 UTC, workflow_dispatch
LUKS E2E Testtest-luks-install.ymlPRs to main, weekly Mon 04:00 UTC, workflow_dispatch
ShellCheck Lintlint.ymlPRs to main, push to main
Python Unit Teststest.ymlPRs to main, push to main

build-iso.yml#

Job: build-and-publish (single job, no matrix)
Runner: ubuntu-24.04
Runs as: root via sudo
Path triggers: live/**, scripts/**, .github/workflows/build-iso.yml

Pipeline steps#

  1. Free disk spacejlumbroso/free-disk-space reclaims ~119 GB at /var/iso-build
  2. Install depsapt-get install podman buildah skopeo mtools xorriso squashfs-tools dosfstools isomd5sum
  3. Log in to GHCRsudo podman login ghcr.io
  4. Pull payload image — pulls only dakota-nvidia:stable (the unified ISO base)
  5. Build live containerpodman build live/ --build-arg TARGET=dakota-nvidialocalhost/dakota-nvidia-live:latest
  6. Build live squashfsscripts/build-live-squashfs.sh with SUPERISO_COMPRESSION=releasedakota-nvidia.rootfs.sfs + dakota-nvidia-boot.tar (~5.3 GB)
  7. Assemble ISOlive/src/build-iso.shdakota-live.iso (no --store flag — OCI already embedded in squashfs as VFS)
  8. Generate checksum — dated + latest variants
  9. Upload to R2dakota-live-YYYYMMDD-<sha>.iso + dakota-live-latest.iso + checksums
  10. Boot verification — QEMU UEFI boot, wait for DAKOTA_LIVE_READY serial marker
  11. Upload artifacts — ISO + checksum + screenshot (7-day retention)

⚠️ Do not add --store back or re-add the offline store squashfs step.
The OCI image is already embedded in the live squashfs via VFS containers-storage.
Building a separate store.squashfs.img doubles the OCI payload, producing an ~8 GB
ISO instead of ~5.3 GB. See lessons below.

⚠️ installer_channel is locked to stable in CI#

Do NOT change installer_channel to dev in the live container build. There is an active
regression in the dev channel (tuna-os/fisherman#38) where the overlay storage
code path fails with:

open /var/tmp/oci-cache/index.json: no such file or directory

Production CI must stay on installer_channel=stable until the regression is fixed.

Disk layout in CI#

The build path is /var/iso-build (~119 GB free after disk-space action).
Peak usage ~22 GB (live squashfs ~6 GB + offline store ~6 GB + ISO ~5 GB + intermediate).
No XFS loopback needed in CI.

Boot verification logic#

CI accepts either:

  1. DAKOTA_LIVE_READY written directly to /dev/ttyS0 by live-ready.service
  2. Finished live-ready.service in the serial log (systemd journal console fallback)

Some dev channel builds don't write the serial marker but still reach GDM.
If both checks fail after 5 minutes, the job fails with tail -50 /tmp/serial.log.

R2 upload#

ISOs are uploaded to the testing bucket as:

  • dakota-live-YYYYMMDD-<sha>.iso — permanent dated record
  • dakota-live-latest.iso — always points to the last successful build
  • Matching -CHECKSUM files for both

⚠️ Direct uploads from the local host hang (routing issue). Always use R2→R2
server-side copies via rclone for local promotion. See docs/r2-promotion.md.

test-luks-install.yml#

Matrix: installer_channel: [dev, stable] (fail-fast: false)
Timeout: 90 minutes
Triggers: PRs to main, weekly schedule, workflow_dispatch

Pipeline steps#

  1. Ensure ci-screenshots branch exists
  2. Free disk space
  3. Install deps (adds qemu-system-x86 ovmf socat sshpass)
  4. Configure podman storage (configure_podman_storage.sh)
  5. Build ISO with debug=1 and the matrix installer_channel
  6. Boot live ISO in QEMU (daemonized) + wait for ready
  7. SSH into live env, write recipe, run fisherman LUKS install
  8. Patch BLS entries for dual console (console=tty0 console=ttyS0)
  9. Boot installed disk, send LUKS passphrase via QEMU monitor
  10. Verify boot success via serial log
  11. Save screenshots to ci-screenshots branch + post PR comment

Configure podman storage script#

.github/scripts/configure_podman_storage.sh — intelligently selects the storage
driver based on the host filesystem:

  • Clears existing podman storage to avoid driver mismatch errors
  • On BTRFS: uses VFS driver (overlayfs is unreliable on BTRFS in CI)
  • On ext4/other: uses overlay driver

Screenshots#

LUKS test screenshots are saved to the ci-screenshots branch and linked in PR
comments. Key screenshots:

  • Live boot (after DAKOTA_LIVE_READY)
  • Plymouth LUKS passphrase prompt
  • Final boot (after passphrase unlock)

Adding a new workflow#

All workflow files go in .github/workflows/. Before adding:

  • Run actionlint (config in .github/actionlint.yaml)
  • Check matrix fail-fast: false for variant builds
  • Do not use installer_channel=dev in scheduled/release builds

lint.yml — ShellCheck#

Runs ShellCheck on every .sh file in the repository. Severity threshold is warning
(style/info is ignored). Uses ludeeus/action-shellcheck@2.0.0.

Any new shell script must pass ShellCheck before merge. For intentional suppression,
add an inline # shellcheck disable=SCxxxx comment with a justification.

test.yml — Python Unit Tests#

Runs pytest tests/ -v against Python 3.11.

FileTestsCoverage
tests/test_luks_unlock.py52luks-unlock.py virsh/QEMU interaction, screenshot parsing, passphrase injection
tests/test_multi_arch_iso.py4build-iso.sh --arch arg parsing; integration tests (skipped if xorriso/mtools absent)

Run locally with:

pip install pytest
pytest tests/ -v

Lessons#

Double-embedded OCI store inflates ISO to 8 GB (2026-06)#

The live container (live/Containerfile) already bakes the OCI image into the squashfs
as VFS containers-storage via configure-live.sh and install-flatpaks.sh.
Building a separate store.squashfs.img with scripts/build-offline-store.sh and
passing --store store.squashfs.img to build-iso.sh embeds the same ~4 GB OCI
image twice in the final ISO — resulting in ~8 GB instead of ~5.3 GB.

Fix: Remove the "Build offline image store squashfs" CI step entirely.
Call build-iso.sh without --store. This is the correct architecture for the
unified VFS-embedded ISO.

Release compression for production ISOs (2026-06)#

scripts/build-live-squashfs.sh defaults to zstd level 3 (SUPERISO_COMPRESSION=fast).
CI sets SUPERISO_COMPRESSION=release (zstd-15, 1M blocks) in the squashfs build step —
this produces ~20% smaller ISOs at ~5× longer squashfs build time. Always use release
for ISOs published to R2. Use fast only for local testing.

installer_channel=dev regression: oci-cache/index.json not found (2026-05)#

After the continuous-dev release ~2026-05, fisherman's overlay storage path fails
with open /var/tmp/oci-cache/index.json: no such file or directory when composefs+btrfs
is the backend. Root cause: fisherman exports the OCI to scratch but bootc inside the
container cannot see it via the bind mount.

Fix: use installer_channel=stable. Keep build-iso.yml on stable until
tuna-os/fisherman#38 is resolved.

DAKOTA_LIVE_READY not seen when live-ready.service uses journal+console (2026-05)#

When StandardOutput=journal+console, the output goes to /dev/console (not /dev/ttyS0).
QEMU serial (-serial file:...) captures ttyS0 output only.

Fix: StandardOutput=tty + TTYPath=/dev/ttyS0 for direct serial writes.
CI falls back to SSH connectivity check if the marker is absent.

Offline install failed: VFS containers-storage missing from CI ISOs (2026-06)#

Symptom: fisherman: fatal: ... reference "containers-storage:ghcr.io/projectbluefin/dakota-nvidia:stable" does not resolve to an image ID

Root cause: The CI build called scripts/build-live-squashfs.sh without --oci-image, so
the live squashfs shipped with an empty /var/lib/containers/storage. The live recipe.json
has local_imgref=containers-storage:ghcr.io/projectbluefin/dakota-nvidia:stable, which
fisherman treats as authoritative. When the local store is empty, the install fails even if
the user has a working internet connection (fisherman does not fall back to docker://).

Why local builds were unaffected: just iso-sd-boot always did the squash+skopeo step and
baked the OCI into the squashfs. CI diverged from this path and the gap was undetected
because CI only validated boot, not install.

Fix: scripts/build-live-squashfs.sh now accepts --oci-image <ref>. When provided it:

  1. Squashes the payload to a single layer via buildah commit --squash
  2. Runs skopeo copy inside the live container (for JSON tar-split compatibility)
  3. Copies the populated VFS staging dir into the squashfs root with cp -a before mksquashfs

build-iso.yml now passes --oci-image ghcr.io/projectbluefin/dakota-nvidia:stable and
asserts the embedded store is non-empty before uploading the ISO to R2.

Invariant: The CI-built squashfs must contain a populated VFS store at
var/lib/containers/storage with the dakota-nvidia:stable image. The assertion step
catches any regression before upload.

See issue #78.

VFS store not captured by mksquashfs when using bind-mount into overlayfs (2026-06)#

Symptom: build-live-squashfs.sh --oci-image runs successfully, VFS store logs 9.1G, but
the squashfs is only ~4.2G (no VFS data) and the assertion fails.

Root cause: When SFS_ROOT is an overlayfs mount (the default on ext4/XFS CI runners),
the overlayfs filesystem has a different st_dev than a bind-mounted directory inside it.
mksquashfs respects filesystem boundaries (stops when st_dev changes) and silently skips
the bind-mounted VFS tree.

Fix: Copy the VFS staging dir into the squashfs root with cp -a instead of bind-mounting.
Writes to an overlayfs path go into the overlay upper layer; the resulting files inherit the
overlayfs st_dev and are included by mksquashfs.

Rule: Never use mount --bind to inject data into a directory that will be squash-packed with
mksquashfs when the mount point is overlayfs. Always copy.

VFS storage paths don't contain image names — assertion must check vfs-images/ (2026-06)#

Symptom: Assertion grep -c "ghcr.io" on unsquashfs -lc output returns 0 even when the
OCI store is correctly embedded.

Root cause: VFS containers-storage uses content-addressed hashes for all paths. Image
names like ghcr.io/projectbluefin/... are stored in JSON metadata inside the hash-named
directories, not in the directory paths themselves. unsquashfs -lc shows file paths only,
so grepping for ghcr.io always returns 0.

Correct assertion: Check for var/lib/containers/storage/vfs-images — this directory is
created by containers/storage for every imported image. If it has entries, the VFS store
was populated.

Note on mksquashfs deduplication: The VFS layer is a squashed copy of the same OS as the
live rootfs. mksquashfs deduplicates identical content blocks, so the squashfs size barely
increases despite embedding 9G of VFS data. Use inode/file counts to confirm inclusion,
not squashfs file size.

ENOSPC in skopeo OCI export — containers/storage tmpdir not redirected (2026-06)#

Symptom: Live ISO installs fail with:

reading blob sha256:...: write /var/tmp/container_images_XXXXXXXX: no space left on device

The installer correctly sets TMPDIR=/mnt/fisherman-target/.fisherman-scratch but the
blob staging file still lands at /var/tmp.

Root cause (3 layers):

  1. configure-live.sh writes /etc/containers/storage.conf with driver = "vfs" but no
    tmpdir line
    .
  2. containers/storage defaults TMPDir to /var/tmp (hardcoded) when the config has no
    tmpdir field. Setting $TMPDIR in the subprocess env is not sufficient — containers/storage
    reads the store config first and uses /var/tmp as the unconditional fallback.
  3. /var/tmp on the live ISO is on the dracut overlayfs (~1.4 GiB writable layer) — too small
    for multi-GiB OCI layer blobs.

Fix: skopeoExportOCI (fisherman) now reads the current effective storage.conf, injects
tmpdir = "<scratchDir>", writes the result to a temp file in the disk-backed scratch dir,
and passes it to skopeo via CONTAINERS_STORAGE_CONF. $TMPDIR is retained for belt-and-
suspenders coverage of containers/image's copy-side blob staging.

Why CI didn't catch it: The LUKS E2E test runs QEMU with 8 GiB RAM; the overlay tmpfs
is ~4 GiB — large enough for individual blobs in most runs. On 8 GiB user laptops with the
live environment loaded, free tmpfs headroom is much lower and ENOSPC triggers reliably.

Prevention: plain-test-qemu (new) runs with qemu-mem=4096 (4 GiB RAM), which gives
only ~2 GiB overlay tmpfs — reliably reproducing this class of bug. The test is gated
before R2 upload in build-iso.yml.

build-live-squashfs.sh WORK dir must be on large disk (2026-06)#

Symptom: Build live squashfs + boot tar step fails with:

write /usr/lib/locale/.../LC_COLLATE: no space left on device
mkdir /vfs-storage/vfs-layers/tmp: no space left on device

Root cause: build-live-squashfs.sh creates WORK at /var/tmp by default.
The squash-to-1-layer + VFS embedding writes ~12 GB of intermediates (payload.oci.tar
~6 GB + VFS staging ~6 GB). /var/tmp on GitHub ubuntu-24.04 runners sits on the
root filesystem which has ~14 GB free after jlumbroso/free-disk-space — not enough
if the image grows at all.

Fix: WORK now uses ${SUPERISO_TMPDIR:-/var/tmp}. In CI, build-iso.yml
sets SUPERISO_TMPDIR: /var/iso-build so all intermediates land on the 119 GB
disk-backed path. Locally the default /var/tmp still applies.

Prevention: If squashfs build ENOSPC recurs in CI, verify SUPERISO_TMPDIR
is set in the Build live squashfs + boot tar step env.

E2E plain install test requires sshd — production ISO has it disabled (2026-06)#

Symptom: Plain install E2E step fails with either:

  • kex_exchange_identification: read: Connection reset by peer (QEMU user-net accepts TCP, no listener inside guest)
  • ERROR: serial marker seen but SSH not ready after 90 s (sshd never starts)

Root cause: sshd is only enabled in the live ISO when the container is built with
--build-arg DEBUG=1. The production build uses DEBUG=0, so no sshd. The E2E test
uses SSH to invoke fisherman; without sshd the test cannot proceed.

Fix: After building the production ISO, a CI step patches the production squashfs:

  1. unsquashfs the production rootfs (includes the embedded VFS store)
  2. Add sshd.service symlink to multi-user.target.wants
  3. Append PasswordAuthentication yes / PermitEmptyPasswords yes to sshd_config
  4. Set liveuser password to live via /etc/shadow patch
  5. mksquashfs back with zstd-1 (fast, debug-only)
  6. Assemble output/dakota-debug-live.iso (uses same boot tar as production)

plain-boot-qemu-live in the justfile prefers output/{{target}}-debug-live.iso
when present, so CI runs against the debug ISO while R2 gets the production ISO.

Why the VFS store must stay: ghcr.io/projectbluefin/dakota-nvidia is private;
the live env inside QEMU has no GHCR credentials, so fisherman cannot pull from
network. The VFS store (embedded in the squashfs) is the only install source.

flatpak-spawn --host does not forward sandbox env to host process (2026-06)#

Symptom: fisherman sets TMPDIR=/mnt/fisherman-target/.fisherman-scratch and prints
# TMPDIR=<scratch> before running skopeo, but the blob staging file is still created
at /var/tmp/container_images_XXXXXXXX causing ENOSPC.

Root cause: The bootc-installer runs inside a Flatpak. When runner.go calls
flatpak-spawn --host skopeo copy ..., flatpak-spawn --host spawns the command in
the HOST mount namespace but does not automatically forward the Flatpak sandbox
environment to the spawned host process. skopeo inherits the host's default env
(no TMPDIR set) and uses /var/tmp for blob staging.

Setting cmd.Env for the flatpak-spawn subprocess propagates env vars to
flatpak-spawn itself, but not to the command it spawns on the host.

Fix (fisherman): runner.HostArgsWithEnv injects critical env vars via
--env=KEY=VALUE flags in the flatpak-spawn args:

flatpak-spawn --host --env=TMPDIR=/scratch --env=CONTAINERS_STORAGE_CONF=... skopeo copy ...

Released in bootc-installer v2.7.1.

How to identify: Look for # TMPDIR=<path> in fisherman output followed by
write /var/tmp/container_images_...: no space left on device. The TMPDIR debug
print confirms fisherman set the var correctly, but skopeo ignoring it confirms the
flatpak-spawn env forwarding gap.

E2E test split into 4 named steps with individual timeouts (2026-06)#

Why: The original single plain-test-qemu step had one monolithic timeout (90 min).
When it expired you had no idea which of the four stages (boot-live, install,
boot-installed, verify) was the bottleneck.

New structure:

StepTimeoutRAMPurpose
E2E 1/4 — Boot live ISO10 min4 GiBLive env ready + SSH confirmed
E2E 2/4 — Install composefs30 min4 GiBENOSPC regression gate (tight tmpfs)
E2E 3/4 — Boot installed disk10 min8 GiBInstalled system POSTs correctly
E2E 4/4 — Verify Graphical target10 min8 GiBsystemd Graphical target reached

Total worst-case ceiling: 60 min (vs. 90 min monolithic), with precise attribution.

Gate 1+2 use 4 GiB to keep the overlay tmpfs tight (~2 GiB) for ENOSPC testing.
Gate 2 switches to 8 GiB for realistic boot performance.

ci | Dosu