How Sigil works · Merge gate for coding agents

§ 01 · LIFECYCLE

Five steps from spec to seal.

A human writes the intent. Sigil compiles it into scenarios. The agent implements. Sigil scores and the agent iterates against opaque feedback. A signed decision closes the loop.

01 SPEC

human

Write the intent

A person writes the spec in markdown — acceptance criteria, invariants, edge cases. One source of truth for what "done" means.

02 GENERATE

sigil

Compile to scenarios

sigil scenario generate turns the spec into Lua scenarios — a visible half, and an age-encrypted holdout the agent will never see.

03 IMPLEMENT

agent

Write the patch

The coding agent reads the spec and the visible scenarios, writes code, runs the visible scenarios locally, and opens a PR.

↺

04 ITERATE

sigil ↔ agent

Opaque feedback

Sigil runs PR and baseline through every scenario. The agent sees only step_3: fail — no values, no source — and retries until it converges.

05 DECIDE

sigil → human

Seal the merge

ALLOW · REVIEW · BLOCK — gated by the trust ladder. AUTO merges; REVIEW goes to a human; BLOCK halts the queue.

§ 02 · EVAL PIPELINE

One step of the protocol, up close. PR ref in, signed decision out.

sigil eval <pr-ref> is the mechanics of steps 04 and 05 above. It resolves the PR to an image digest, decrypts the scenario bundle, deploys PR and baseline side-by-side, runs three tiers of Lua scenarios against both, computes a baseline-relative satisfaction score, appends eval.complete and eval.decision to the git-backed ledger, and emits feedback stripped of any content the authoring agent is not permitted to see.

ALLOW path REVIEW path BLOCK path Fig. 02 · baseline-relative scoring

artifact digest

sha256:9f42…c0b1

baseline digest

sha256:1d7a…8e44

scenario set

set:auth.v12

rng seed

0xDEADB2B5

control ref

ctrl:2026.04.12

evaluator

[email protected]

§ 03 · SPEC

Your intent, in markdown.

Scenario generation starts here. A plain markdown spec — acceptance criteria, edge cases, stated invariants, quality bars — becomes a visible/holdout scenario bundle via sigil scenario generate. The document below is the real source for the billing/upgrade scenario shown in the next section.

docs/specs/billing-upgrade.md

INPUT · HUMAN-AUTHORED rev 2026-04-18


             1
            ---
          

             2
            title: "Billing — self-serve plan upgrade"
          

             3
            owner: "@jordan"
          

             4
            priority: P0
          

             5
            ---
          

             6
            
          

             7
            # Self-serve plan upgrade
          

             8
            
          

             9
            A free-tier user should upgrade to Pro from the dashboard without
          

            10
            talking to sales. The upgrade must charge the card, reflect in the
          

            11
            subscription API, and send a clean receipt.
          

            12
            
          

            13
            ## Plans and price
          

            14
            
          

            15
            | Plan | Monthly | Annual |
          

            16
            |------|---------|--------|
          

            17
            | Free | $0      | —      |
          

            18
            | Pro  | $12/mo  | $99/yr |
          

            19
            
          

            20
            Annual is the default selection in the upgrade UI.
          

            21
            
          

            22
            ## Acceptance criteria
          

            23
            
          

            24
            1. A free account hits `POST /api/signup` and receives a `session_token`.
          

            25
            2. With that session, the user opens `/account/billing`, picks Pro
          

            26
               (annual), and completes Stripe checkout using `4242 4242 4242 4242`
          

            27
               without leaving the dashboard.
          

            28
            3. After checkout, the Pro plan badge is visible and a "Cancel plan"
          

            29
               action is available.
          

            30
            4. `GET /api/subscription` returns `{ "plan": "pro", "status": "active" }`
          

            31
               on the next read.
          

            32
            
          

            33
            ## Invariants
          

            34
            
          

            35
            - `GET /api/subscription` is a **pure read**. N reads in any order
          

            36
              return the same plan, regardless of `X-Request-Id`.
          

            37
            - Upgrades are **idempotent** at the payment_intent layer: a repeated
          

            38
              checkout for the same intent must not double-charge.
          

            39
            
          

            40
            ## Quality bar (LLM-judged)
          

            41
            
          

            42
            The receipt email copy is professional, free of typos, and names the
          

            43
            correct plan, price, and next billing date.

$ sigil scenario generate --from docs/specs/billing-upgrade.md --service api

LLM test plan → scenario code → three-stage validation → visible/ + holdout/ bundles.

§ 04 · SCENARIO DSL

Scenarios are plain Lua. Power assertions, rubric doc-comments, property tests.

Three globals are pre-injected — sigil, expect, invariant. Comparisons inside expect are rewritten to capture both sides and render Ariadne-style code frames on failure. Triple-dash comments become rubrics for the Tier-3 judge.

.sigil/scenarios/api/visible/billing/upgrade.lua

VISIBLE · TIER 1·2·3 sha256: 7a3f…b12e


               1
              -- .sigil/scenarios/api/visible/billing/upgrade.lua
              
            

               2
              return {
              
            

               3
                title    = "Upgrade a free account to the Pro plan via the dashboard",
              
            

               4
                priority = "P0",
              
            

               5
                tags     = {"billing", "checkout"},
              
            

               6
                policy   = { capabilities = {"http", "browser", "intent", "judge", "property"} },
              
            

               7
              
              
            

               8
                run = function()
              
            

               9
                  -- 1. Seed a fresh free-tier account via the HTTP API.
              
            

              10
                  local signup = sigil.post("/api/signup", {
              
            

              11
                    email    = sigil.gen.email(),
              A
            

              12
                    password = sigil.env("SIGNUP_PASSWORD"),
              B
            

              13
                  })
              
            

              14
                  expect(signup.status == 201)
              
            

              15
                  local token = signup.json.session_token
              
            

              16
              
              
            

              17
                  -- 2. Hand the browser session to the agent and let it drive checkout.
              
            

              18
                  --    The `---` block is the objective. The LLM uses the declared
              
            

              19
                  --    capabilities (browser, http) to accomplish it; capture fields
              
            

              20
                  --    with type prefixes are added to the `complete` tool schema.
              
            

              21
                  --- Upgrade the account to the Pro plan (annual billing, $99/yr).
              
            

              22
                  --- Use the test card 4242 4242 4242 4242, any future expiry, any CVC.
              
            

              23
                  --- Confirm the upgrade completed and record the confirmation details.
              
            

              24
                  local result = sigil.intent({
              D
            

              25
                    capabilities = { "browser", "http" },
              
            

              26
                    context      = { session_token = token },
              
            

              27
                    capture = {
              
            

              28
                      order_id    = "string: the order confirmation number",
              
            

              29
                      total_cents = "number: the final charged amount in cents",
              
            

              30
                      plan        = "string: the plan name shown on the receipt",
              
            

              31
                    },
              
            

              32
                    max_steps = 20,
              
            

              33
                  })
              
            

              34
                  expect(result.completed)
              
            

              35
                  expect(result.plan == "Pro")
              
            

              36
                  expect(result.total_cents == 9900)
              C
            

              37
              
              
            

              38
                  -- 3. Direct browser assertions — getters return strings, actions
              
            

              39
                  --    return nil; sessions auto-isolate per scenario ID.
              
            

              40
                  sigil.browser.open("/account/billing")
              E
            

              41
                  sigil.browser.wait({ text = "Pro" })
              
            

              42
                  expect(sigil.browser.text("[data-testid=plan-badge]") == "Pro")
              
            

              43
                  expect(sigil.browser.visible("[data-testid=cancel-plan]"))
              
            

              44
              
              
            

              45
                  -- 4. Cross-check the API agrees with the UI.
              
            

              46
                  local sub = sigil.get("/api/subscription", nil, {
              F
            

              47
                    headers = { Authorization = "Bearer " .. token },
              
            

              48
                  })
              
            

              49
                  expect(sub.json.plan == "pro")
              
            

              50
                  expect(sub.json.status == "active")
              
            

              51
              
              
            

              52
                  -- 5. Property: /api/subscription is a pure read — N reads of the same
              
            

              53
                  --    endpoint return the same plan, regardless of request id or order.
              
            

              54
                  invariant("GET /api/subscription is idempotent", {
              G
            

              55
                    cases = 10,
              
            

              56
                    for_all = { req_id = sigil.gen.uuid() },
              
            

              57
                    check = function(case)
              
            

              58
                      local r = sigil.get("/api/subscription", nil, {
              
            

              59
                        headers = { Authorization = "Bearer " .. token, ["X-Request-Id"] = case.req_id },
              
            

              60
                      })
              
            

              61
                      expect(r.json.plan == "pro")
              
            

              62
                    end,
              
            

              63
                  })
              
            

              64
              
              
            

              65
                  --- The receipt email copy is clear, professional, and free of typos.
              
            

              66
                  --- It names the correct plan, the correct price, and the next billing date.
              
            

              67
                  sigil.judge(sub.json.receipt_preview, { min_score = 0.85 })
              H
            

              68
                end,
              
            

              69
              }

A sigil.gen.* line 11

generators

Deterministic value generators seeded from the scenario RNG. Emails, UUIDs, ints, strings — every call is reproducible across runs, so property tests, fuzzers, and replays all share one seed chain.

B sigil.env() line 12

env access

Typed access to an evaluator-scoped secret vault. Values are read from the locked environment, never inlined into the scenario source, and never echoed into feedback or the ledger.

C expect line 36

power assertions

Comparisons inside expect are rewritten to capture both sides. On failure, the renderer prints an Ariadne code frame with the value each sub-expression resolved to — this line is the one that fails in §06.

D sigil.intent line 24

agent instruction

Plain-English objective in the `---` block. The LLM drives the declared capabilities via tool-use; typed capture fields become the completion schema, so the agent must return structured values.

E sigil.browser line 40

browser automation

Getters (text, url, visible) return strings; actions (open, click, fill, wait) return nil. Sessions auto-isolate per scenario ID — no cookie bleed between parallel runs.

F sigil.get line 46

http calls

Typed HTTP against the deployed service. Request and response metadata, headers, and JSON bodies are surfaced structured — not as raw strings you have to parse back out.

G invariant line 54

property testing

Generate N cases from the declared generators, run the check against each, and shrink counter-examples on failure. A claim about the service, not a sequence of steps.

H sigil.judge line 67

llm rubric

Plain-English rubric in the preceding `---` block becomes the grading criteria for a Tier-3 judge. The score and rubric digest land in the ledger; the prompt never leaves the evaluator.

§ 05 · THE LOOP

Three commands in the loop.

The agent implements, verifies locally, pushes. Sigil runs the full eval in CI. The agent reads the opaque feedback and revises. Visible passes aren't a green light — the holdouts make their own decision.

05 · a agent · local Verify against visible scenarios

Before pushing, the agent spins up an ephemeral environment and runs the visible half of the suite. Holdouts stay encrypted — the private key for this service isn't in the authoring workspace.

$ sigil scenario run --all --deploy --service api
▸ deploying ephemeral environment (docker compose up -d)... ready (4.2s)
▸ running 12 visible scenarios against api@eph-7f3c:8080

  auth/login              pass  (420ms)
  auth/logout             pass  (180ms)
  billing/upgrade         pass  (12.4s)
  billing/cancel          pass  (3.1s)
  ...  8 more             pass
  12/12 passed

▸ holdout scenarios not run — private key not in this workspace
▸ teardown complete · 18.2s total

05 · b sigil · ci Score against baseline, visible + holdout

Sigil CI deploys PR and baseline side-by-side, runs every scenario (visible and holdout), and scores satisfaction baseline-relative. Everything lands in the ledger — the evaluator does not make decisions it can't show its work for.

$ sigil eval pull/42/head --service api
▸ resolving pull/42/head → sha256:9f42…c0b1
  baseline (merge-base) → sha256:1d7a…8e44
▸ decrypting scenario bundle · 12 visible · 8 holdout · scenario_set:billing.v7
▸ deploying dual environments · pr@eph-9f42 · baseline@eph-1d7a · ready (8.3s)
▸ running scenarios against both…

               pr     baseline   Δ
  visible     1.00      0.98    +0.02
  holdout     0.82      0.96    −0.14
  overall     0.94      0.97    −0.03

▸ satisfaction: 0.94 — below P0 threshold (0.95)
▸ ledger · eval.complete eval_01HPXG5KQ7J9W4
         · eval.decision REVIEW — regression on 2 holdout scenarios

decision: REVIEW · sigil decide pull/42/head --service api

05 · c agent · ci Read back — opaque by design

The agent can pull feedback for the eval, but only the lossy projection. Scenario names and step bodies are opaque for holdouts — but each failure comes with a pointer to the originating spec, so the agent knows where to look without seeing what was tested. Enough to iterate against docs/specs/billing-upgrade.md; not enough to reverse the hidden suite.

$ sigil feedback eval_01HPXG5KQ7J9W4 --service api
scenario: auth/login                    5 pass
scenario: auth/logout                   3 pass
scenario: billing/upgrade               5 pass
scenario: billing/cancel                4 pass
scenario: holdout_001                   3 pass
scenario: holdout_002                   4 pass
scenario: holdout_003                   2 pass · 1 fail
  spec    : docs/specs/billing-upgrade.md
  step_1  : pass
  step_2  : pass
  step_3  : fail
scenario: holdout_005                   1 pass · 1 fail
  spec    : docs/specs/payment-intents.md
  step_1  : pass
  step_2  : fail
scenario: holdout_007                   2 pass

aggregate: 38 pass · 2 fail · decision = REVIEW
wall: holdout/* frames not available to the authoring agent (by design)

§ 06 · EXPECT FAILURES

Rich evidence for the operator. Opaque step labels for the agent.

When an expect fails, the power-assertion renderer shows the full ladder of sub-expressions with the value each one resolved to, the captured tables from any preceding sigil.intent calls, the rubric pulled from the --- block above, and an Ariadne code frame pinning the source span. Visible scenarios surface all of it to the operator. Holdout scenarios only ever emit the step label — the wall does not bend.

OPERATOR VIEW visible

full provenance · rubric · power-assertion tree

× scenario failed: billing/upgrade  [P0]

   ╭─[.sigil/scenarios/api/visible/billing/upgrade.lua:33:5]
   │
33 │     expect(result.total_cents == 9900)
   │     ──────────────────┬───────────────
   │                       ╰── assertion is false
   │
   │  result.total_cents == 9900
   │  │      │           │  │
   │  │      │           │  9900
   │  │      │           false
   │  │      7900
   │  ╰─ <intent result, captured above>
   ·
33 │     result = {
   │       completed   = true,
   │       summary     = "upgraded to Pro plan on monthly billing",
   │       plan        = "Pro",
   │       total_cents = 7900,
   │       order_id    = "ord_01HPXG5KQ7J9W4…",
   │     }
───╯

↳ rubric (sigil.intent objective, lines 22–24):
    Upgrade the account to the Pro plan (annual billing, $99/yr).
    Use the test card 4242 4242 4242 4242, any future expiry, any CVC.
    Confirm the upgrade completed and record the confirmation details.

↳ 3 of 3 preceding expects passed; this is the first failure.
↳ scenario score: 0.00  (P0, blocking)
↳ rolled up: eval.complete → decision = BLOCK

The power-assertion renderer prints the ladder of sub-expressions and the value each resolved to at the moment the assertion ran. result.total_cents is 7900; the literal is 9900; the comparison is false. Above the frame, the full result table from sigil.intent is dumped so the operator can read the agent's own summary — “upgraded to Pro plan on monthly billing” — and understand exactly where the objective was missed.

AGENT VIEW lossy · step-label only

∩ holdout = ∅ · no values · no source · no rubric

scenario: billing/upgrade
spec:     docs/specs/billing-upgrade.md

  step_1: pass
  step_2: pass
  step_3: fail
  step_4: skip
  step_5: skip

  5 steps · 1 failure · step bodies, values, and rubric withheld

The authoring agent receives the failure as opaque step labels plus a pointer to the originating spec. The spec is safe to name — it's human-authored and already in the repo. What stays withheld is the step bodies, the rubric text, the power-assertion tree, and the captured values. The agent sees that step 3 failed under billing-upgrade.md; it does not see why, or what the scenario asserted.

§ 07 · EARNED AUTONOMY

Trust is computed, not configured. Services climb one tier at a time.

Every service carries a ledger history. Agreement with human reviewers, incident-free windows, and evaluation count promote a service up the ladder; any override, regression, or safety incident decays it back down. ALLOW is only reachable from the upper two tiers.

I NONE

not enrolled

decision all PRs → REVIEW

promote ↑ register service · sync ledger

decay ↓ —

II SHADOW

observing

decision decisions recorded · human-gated

promote ↑ ≥ 50 evals · κ ≥ 0.80 agreement

decay ↓ any BLOCK override → NONE

III ADVISORY

suggesting

decision ALLOW/REVIEW recommended · merge still human

promote ↑ ≥ 200 evals · 14-day incident-free · scorecard ≥ 0.92

decay ↓ incident tagged eval.regression → SHADOW

IV AUTO

deciding

decision ALLOW merges automatically · BLOCK halts queue

promote ↑ — terminal tier —

decay ↓ 1 incident → ADVISORY · 2 in 30d → SHADOW

§ 08 · FOUR INVARIANTS

Non-negotiable. Present in every decision the system emits.

The invariants below are not policies to be tuned — they are the contract Sigil enters into with the organizations that deploy it. Violations of any of the four halt the queue and surface as operational incidents.

fail-closed

any error downgrades — never to ALLOW

Timeouts, crashes, unreachable ledger, malformed scenarios, judge disagreement — every failure mode is a path to REVIEW or BLOCK. The system is incapable of producing an ALLOW it did not earn.

err ∈ {timeout, crash, oom, …} ⇒ decision ∈ {REVIEW, BLOCK}

isolation wall

the agent never sees the holdouts

Scenarios are split into visible and age-encrypted holdouts. The coding agent that authored the PR receives only opaque step labels (step_1, step_2) — no content, no expected values, no rubric text. The dark factory is literal.

feedback ∩ holdout_content = ∅

III

reproducibility

every eval carries a six-field provenance tuple

Artifact digest · baseline digest · scenario set hash · RNG seed · control ref hash · evaluator version. Re-running with the same tuple yields bit-identical scores. Decisions are signed evidence, not opinions.

⟨art, base, scn, seed, ctrl, eval⟩ → same score

freshness gate

the ledger must be current to ALLOW

The append-only git-backed ledger has a staleness window. Outside it, no ALLOW is possible regardless of scores. This closes the loop on incident decay: stale trust is suspect trust.

now − ledger.tip.ts ≤ Δ_fresh

§ 09 · WHY DARK FACTORY

A benchmark the agent can read is a benchmark the agent can game.

Hold out what the system is judged on. Reveal nothing the authoring agent could optimize toward.

Coding agents are trained and prompted against visible benchmarks. They are good at producing patches that pass the tests they were shown. This is useful — and it is also the exact failure mode that turns a merge queue into a rubber stamp.

Sigil splits every scenario set into a visible portion (examples, shape, capabilities) and an age-encrypted holdout that is decrypted only inside the evaluator. The agent receives opaque step labels — step_1, step_2 — and pass/fail counts. No messages, no expected values, no rubric text.

The result is a decision that distinguishes understanding the intent from matching the surface. If the patch works, holdouts pass. If it only looks like it works, they don't.

ISOLATION WALL · FIG.03 feedback ∩ holdout = ∅