A human writes the intent. Sigil compiles it into scenarios. The agent implements. Sigil scores and the agent iterates against opaque feedback. A signed decision closes the loop.
sigil eval <pr-ref> is the mechanics of steps 04 and 05 above. It resolves the PR to an image digest, decrypts the scenario bundle, deploys PR and baseline side-by-side, runs three tiers of Lua scenarios against both, computes a baseline-relative satisfaction score, appends eval.complete and eval.decision to the git-backed ledger, and emits feedback stripped of any content the authoring agent is not permitted to see.
Scenario generation starts here. A plain markdown spec — acceptance criteria, edge cases, stated invariants, quality bars — becomes a visible/holdout scenario bundle via sigil scenario generate. The document below is the real source for the billing/upgrade scenario shown in the next section.
1 ---2 title: "Billing — self-serve plan upgrade"3 owner: "@jordan"4 priority: P05 ---67 # Self-serve plan upgrade89 A free-tier user should upgrade to Pro from the dashboard without10 talking to sales. The upgrade must charge the card, reflect in the11 subscription API, and send a clean receipt.1213 ## Plans and price1415 | Plan | Monthly | Annual |16 |------|---------|--------|17 | Free | $0 | — |18 | Pro | $12/mo | $99/yr |1920 Annual is the default selection in the upgrade UI.2122 ## Acceptance criteria2324 1. A free account hits `POST /api/signup` and receives a `session_token`.25 2. With that session, the user opens `/account/billing`, picks Pro26 (annual), and completes Stripe checkout using `4242 4242 4242 4242`27 without leaving the dashboard.28 3. After checkout, the Pro plan badge is visible and a "Cancel plan"29 action is available.30 4. `GET /api/subscription` returns `{ "plan": "pro", "status": "active" }`31 on the next read.3233 ## Invariants3435 - `GET /api/subscription` is a **pure read**. N reads in any order36 return the same plan, regardless of `X-Request-Id`.37 - Upgrades are **idempotent** at the payment_intent layer: a repeated38 checkout for the same intent must not double-charge.3940 ## Quality bar (LLM-judged)4142 The receipt email copy is professional, free of typos, and names the43 correct plan, price, and next billing date.
Three globals are pre-injected — sigil, expect, invariant. Comparisons inside expect are rewritten to capture both sides and render Ariadne-style code frames on failure. Triple-dash comments become rubrics for the Tier-3 judge.
1 -- .sigil/scenarios/api/visible/billing/upgrade.lua2 return {3 title = "Upgrade a free account to the Pro plan via the dashboard",4 priority = "P0",5 tags = {"billing", "checkout"},6 policy = { capabilities = {"http", "browser", "intent", "judge", "property"} },78 run = function()9 -- 1. Seed a fresh free-tier account via the HTTP API.10 local signup = sigil.post("/api/signup", {11 email = sigil.gen.email(), A12 password = sigil.env("SIGNUP_PASSWORD"), B13 })14 expect(signup.status == 201)15 local token = signup.json.session_token1617 -- 2. Hand the browser session to the agent and let it drive checkout.18 -- The `---` block is the objective. The LLM uses the declared19 -- capabilities (browser, http) to accomplish it; capture fields20 -- with type prefixes are added to the `complete` tool schema.21 --- Upgrade the account to the Pro plan (annual billing, $99/yr).22 --- Use the test card 4242 4242 4242 4242, any future expiry, any CVC.23 --- Confirm the upgrade completed and record the confirmation details.24 local result = sigil.intent({ D25 capabilities = { "browser", "http" },26 context = { session_token = token },27 capture = {28 order_id = "string: the order confirmation number",29 total_cents = "number: the final charged amount in cents",30 plan = "string: the plan name shown on the receipt",31 },32 max_steps = 20,33 })34 expect(result.completed)35 expect(result.plan == "Pro")36 expect(result.total_cents == 9900) C3738 -- 3. Direct browser assertions — getters return strings, actions39 -- return nil; sessions auto-isolate per scenario ID.40 sigil.browser.open("/account/billing") E41 sigil.browser.wait({ text = "Pro" })42 expect(sigil.browser.text("[data-testid=plan-badge]") == "Pro")43 expect(sigil.browser.visible("[data-testid=cancel-plan]"))4445 -- 4. Cross-check the API agrees with the UI.46 local sub = sigil.get("/api/subscription", nil, { F47 headers = { Authorization = "Bearer " .. token },48 })49 expect(sub.json.plan == "pro")50 expect(sub.json.status == "active")5152 -- 5. Property: /api/subscription is a pure read — N reads of the same53 -- endpoint return the same plan, regardless of request id or order.54 invariant("GET /api/subscription is idempotent", { G55 cases = 10,56 for_all = { req_id = sigil.gen.uuid() },57 check = function(case)58 local r = sigil.get("/api/subscription", nil, {59 headers = { Authorization = "Bearer " .. token, ["X-Request-Id"] = case.req_id },60 })61 expect(r.json.plan == "pro")62 end,63 })6465 --- The receipt email copy is clear, professional, and free of typos.66 --- It names the correct plan, the correct price, and the next billing date.67 sigil.judge(sub.json.receipt_preview, { min_score = 0.85 }) H68 end,69 }
The agent implements, verifies locally, pushes. Sigil runs the full eval in CI. The agent reads the opaque feedback and revises. Visible passes aren't a green light — the holdouts make their own decision.
$ sigil scenario run --all --deploy --service api ▸ deploying ephemeral environment (docker compose up -d)... ready ▸ running 12 visible scenarios against api@eph-7f3c:8080 auth/login pass auth/logout pass billing/upgrade pass billing/cancel pass ... 8 more pass 12/12 passed ▸ holdout scenarios not run — private key not in this workspace ▸ teardown complete ·
$ sigil eval pull/42/head --service api ▸ resolving pull/42/head → sha256:9f42…c0b1 baseline (merge-base) → sha256:1d7a…8e44 ▸ decrypting scenario bundle · 12 visible · 8 holdout · scenario_set:billing.v7 ▸ deploying dual environments · pr@eph-9f42 · baseline@eph-1d7a · ready ▸ running scenarios against both… pr baseline Δ visible 1.00 0.98 +0.02 holdout 0.82 0.96 −0.14 overall 0.94 0.97 −0.03 ▸ satisfaction: 0.94 — below P0 threshold (0.95) ▸ ledger · eval.complete eval_01HPXG5KQ7J9W4 · eval.decision REVIEW — regression on 2 holdout scenarios decision: REVIEW · sigil decide pull/42/head --service api
$ sigil feedback eval_01HPXG5KQ7J9W4 --service api scenario: auth/login 5 pass scenario: auth/logout 3 pass scenario: billing/upgrade 5 pass scenario: billing/cancel 4 pass scenario: holdout_001 3 pass scenario: holdout_002 4 pass scenario: holdout_003 2 pass · 1 fail spec : docs/specs/billing-upgrade.md step_1 : pass step_2 : pass step_3 : fail scenario: holdout_005 1 pass · 1 fail spec : docs/specs/payment-intents.md step_1 : pass step_2 : fail scenario: holdout_007 2 pass aggregate: 38 pass · 2 fail · decision = REVIEW wall: holdout/* frames not available to the authoring agent (by design)
When an expect fails, the power-assertion renderer shows the full ladder of sub-expressions with the value each one resolved to, the captured tables from any preceding sigil.intent calls, the rubric pulled from the --- block above, and an Ariadne code frame pinning the source span. Visible scenarios surface all of it to the operator. Holdout scenarios only ever emit the step label — the wall does not bend.
× scenario failed: billing/upgrade [P0]
╭─[.sigil/scenarios/api/visible/billing/upgrade.lua:33:5]
│
33 │ expect(result.total_cents == 9900)
│ ──────────────────┬───────────────
│ ╰── assertion is false
│
│ result.total_cents == 9900
│ │ │ │ │
│ │ │ │ 9900
│ │ │ false
│ │ 7900
│ ╰─ <intent result, captured above>
·
33 │ result = {
│ completed = true,
│ summary = "upgraded to Pro plan on monthly billing",
│ plan = "Pro",
│ total_cents = 7900,
│ order_id = "ord_01HPXG5KQ7J9W4…",
│ }
───╯
↳ rubric (sigil.intent objective, lines 22–24):
Upgrade the account to the Pro plan (annual billing, $99/yr).
Use the test card 4242 4242 4242 4242, any future expiry, any CVC.
Confirm the upgrade completed and record the confirmation details.
↳ 3 of 3 preceding expects passed; this is the first failure.
↳ scenario score: 0.00 (P0, blocking)
↳ rolled up: eval.complete → decision = BLOCK result.total_cents is 7900; the literal is 9900; the comparison is false. Above the frame, the full result table from sigil.intent is dumped so the operator can read the agent's own summary — “upgraded to Pro plan on monthly billing” — and understand exactly where the objective was missed.
scenario: billing/upgrade spec: docs/specs/billing-upgrade.md step_1: pass step_2: pass step_3: fail step_4: skip step_5: skip 5 steps · 1 failure · step bodies, values, and rubric withheld
billing-upgrade.md; it does not see why, or what the scenario asserted.
Every service carries a ledger history. Agreement with human reviewers, incident-free windows, and evaluation count promote a service up the ladder; any override, regression, or safety incident decays it back down. ALLOW is only reachable from the upper two tiers.
The invariants below are not policies to be tuned — they are the contract Sigil enters into with the organizations that deploy it. Violations of any of the four halt the queue and surface as operational incidents.
Hold out what the system is judged on. Reveal nothing the authoring agent could optimize toward.
Coding agents are trained and prompted against visible benchmarks. They are good at producing patches that pass the tests they were shown. This is useful — and it is also the exact failure mode that turns a merge queue into a rubber stamp.
Sigil splits every scenario set into a visible portion (examples, shape, capabilities) and an age-encrypted holdout that is decrypted only inside the evaluator. The agent receives opaque step labels — step_1, step_2 — and pass/fail counts. No messages, no expected values, no rubric text.
The result is a decision that distinguishes understanding the intent from matching the surface. If the patch works, holdouts pass. If it only looks like it works, they don't.