How we test

The fidelity loop — a deterministic IDML generator emits fixtures, InDesign exports matching reference PDFs, and a per-page ΔE2000 plus SSIM image diff gates the renderer in CI on the CPU backend, with thresholds that only ever tighten.

The fidelity loop measures the renderer against InDesign's own output, page by page, and fails the build if the pixels drift.

In short: Paged proves visual correctness with a closed loop. A deterministic generator emits IDML fixtures where each page exercises exactly one renderable feature variant, and Adobe InDesign exports a reference PDF from each of those same documents. In CI, the renderer rasterises every page through the CPU backend, the reference PDF is rasterised alongside it, and an image-diff tool compares the two using ΔE2000 (perceptual colour difference) and SSIM (structural similarity), with every gated page held under a per-fixture tolerance or the merge is blocked. This page walks the loop end to end and explains the one rule that keeps it honest: never loosen a threshold to make a failure go away.

The generator: one feature per page

The fixtures are not hand-drawn. They are emitted by a small Rust generator (crates/paged-gen) that builds IDML packages directly — a builder per part (builders/designmap.rs, spread.rs, story.rs, resources.rs, master.rs, and friends) assembled by sample definitions in crates/paged-gen/src/samples/. There is one sample module per feature family: geometry, geometry_groups, gradients, images, strokes_fills, tables, text, text_advanced, text_letterspacing, text_wrap, transparency, effects, anchored.

Each sample is a multi-page mega-file, and the design is deliberate: every page holds one variant of one feature. The geometry fixture, for instance, is a run of A4 pages each carrying a single filled rectangle under one ItemTransform variant — identity, translation, rotation, scale, skew. The page's Page.Name carries the variant descriptor (something like geometry · rect · rotate-45), so when a diff fails, the failure names the exact case without a separate sidecar file. One InDesign export then covers many test cases at once, and a regression points at a page, not a haystack.

The generator is deterministic: same inputs, byte-stable output. The emitted .idml files are reproducible build artifacts rather than committed blobs — CI regenerates each one from paged-gen immediately before diffing it. What is committed in corpus/generated/ is the durable half of each pair: the InDesign reference PDF, an export.meta.json recording how it was exported (the InDesign version and PDF preset), and the gate's configuration.

The reference: InDesign's own export

The thing the renderer is measured against is not a spec reading or another renderer — it is Adobe InDesign rendering the same document. For each fixture, InDesign opens the generated IDML and exports a PDF (the export tooling lives outside this docs scope, in the engine's tools/indesign-export/). That PDF is the ground truth: whatever InDesign painted is, by definition, correct.

This is also where the corpus earns "license-clear." The fixtures are generated by our own code from our own content, so the IDML and the PDF are both ours to redistribute — unlike real-world sample documents, which carry third-party fonts and artwork. (The engine's CI has a separate, advisory hook for diffing genuine third-party IDML/PDF pairs a developer stages locally, but those gate nothing and are never committed.)

One honest caveat travels with baked reference PDFs: a PDF is exported once, and if the export host lacked the IDML's declared font, InDesign substituted its own and baked that substitution into the PDF. The engine handles this consciously per fixture — re-export on a host with the font, or make the renderer substitute to match — and records the choice in a per-fixture *.fonts.sh. The reference is only as good as the export that produced it, and the project treats that as a fact to manage, not hide.

The diff: ΔE2000 and SSIM

Comparing two page images pixel-by-pixel with raw RGB equality would flag differences no human could see and miss ones they would. The fidelity crate (crates/paged-fidelity) uses two perceptually grounded metrics instead, exposed through the paged-diff CLI:

Metric	What it measures
mean ΔE	Average CIEDE2000 colour difference across all pixels. Catches broad colour or tone drift — a gradient that is subtly off, a fill in the wrong space.
p99 ΔE	99th-percentile ΔE. Catches localised errors — a misplaced glyph edge, a stroke a pixel off — that a mean would average away.
max ΔE	Worst single-pixel difference. Reported for triage; not itself a gate.
SSIM	Structural similarity (1.0 = identical). Catches structural drift — shifted text, broken layout — independent of absolute colour.

paged-diff takes the reference PNG first and the candidate PNG second (the ordering is load-bearing — reference, then candidate), emits the four numbers as JSON, and optionally writes a heatmap that paints where the differences are. The crate also carries the project's long-term target as constants: mean ΔE ≤ 1.0, p99 ΔE ≤ 2.5, SSIM ≥ 0.99 on every page.

The gate: per-fixture tolerances

The long-term target is the destination, not the daily gate. The daily gate is corpus/generated/fidelity-thresholds.json, which gives each fixture its own worst-page budget: max_mean_de, max_p99_de, min_ssim, and a max_pages_with_pdf bound. A fixture passes only when every page that has a matching reference PDF page satisfies all three: mean ΔE under budget, p99 ΔE under budget, and SSIM at or above the floor.

The max_pages_with_pdf bound exists because IDML fixtures grow faster than their exports: a sample may ship more pages than the last InDesign export covered, and the gate only checks the first N pages that have a reference. Pages without a matching reference are skipped rather than failing — the images fixture, for example, ships fourteen pages but its reference PDF covers only the first five, so only those five are gated.

The orchestration lives in corpus/generated/diff.sh. Per fixture it: regenerates the IDML from paged-gen; renders every page through the renderer to candidate PNGs; rasterises the matching PDF pages to reference PNGs; runs paged-diff per page into a report.json; then compares each page against that fixture's thresholds, writing a gate.json verdict. Any page over budget fails the fixture, and any failed fixture fails the run.

The CPU backend, and why CI uses it

The renderer can rasterise two ways: the default CPU backend (tiny-skia) and a GPU backend (Vello, via wgpu). The fidelity gate runs on the CPU backend, on purpose. CI runners are headless and have no usable GPU, so a GPU-only gate could not run there at all. The CPU path is deterministic and host-independent, which is exactly what a regression gate needs: the same fixture diffs to the same numbers on every runner, so a threshold means the same thing everywhere. The reference PDF rasterisation likewise runs through a standard, scriptable rasteriser rather than a GPU.

The discipline: thresholds only tighten

The single rule that keeps the loop trustworthy: never loosen a threshold to make a failing test pass. The thresholds are sized to the current worst observed page plus modest headroom (roughly 15–25%) so ordinary rasteriser noise does not cause flapping. When a regression trips a threshold, the fix is to fix the regression — not to raise the number until green returns.

Movement is allowed in one direction. As the renderer improves and a fixture's real numbers fall well under budget, the threshold is tightened toward the long-term target so it keeps catching future drift instead of going slack. The file's own calibration notes spell out the ratchet: a measurement sitting under half its budget gets dropped to roughly worst × 1.20; under three-quarters, to worst × 1.15; already near the edge is left alone. Each fixture's rationale records what the worst page actually measured and why its budget is where it is — so loosening anything would require contradicting a written record, in the commit message, on the way past a gate that blocks the merge.

The hard gate in CI

All of this runs as a required check. The engine's fidelity workflow runs on pushes to main and on any pull request touching the crates, the corpus, or the workflow itself, across both Linux and macOS runners. It installs the PDF rasteriser, builds the generator, the renderer, and the diff CLI, runs a golden-snapshot regression test, then runs diff.sh as the hard gate. On failure it uploads the candidate, reference, and heatmap PNGs plus the JSON reports as artifacts, so a regression can be triaged from the exact images that tripped it. A failure here blocks the merge. That is the whole point: visual correctness is not a thing someone remembers to check, it is a thing the build refuses to ship without.

Frequently asked questions

Why generate fixtures instead of collecting real InDesign documents? Two reasons. Generated fixtures are license-clear — built from our own code and content, so both the IDML and its reference PDF are ours to redistribute, where real documents carry third-party fonts and artwork that cannot ship publicly. And they are controlled: one feature variant per page, named in Page.Name, so a failure attributes itself to an exact case instead of a busy real-world layout where a dozen features overlap on one page.

What is the difference between ΔE and SSIM, and why use both? ΔE2000 measures colour difference perceptually — how different two pixels look — and is read both as a mean (broad drift) and a 99th percentile (localised errors). SSIM measures structural similarity — whether the same shapes sit in the same places — largely independent of absolute colour. A page can pass one and fail the other: a small uniform colour shift moves ΔE while leaving SSIM near 1.0; text shifted by a pixel barely moves the mean ΔE but drops SSIM. Gating on both catches both failure modes.

What does the fidelity gate not catch? Anything no fixture exercises. The gate is pixel-deep but only over the generated corpus, so a construct without a fixture is visually untested no matter how green the run is. It also inherits the limits of its inputs: a reference PDF exported with a substituted font bakes that substitution in, so the gate measures against whatever InDesign actually painted that day, not against a Platonic ideal. The gate is strong evidence within its coverage and silent outside it.

On this page