What ten models showed about measuring generated design

Every measurement project reaches a point where the instrument reports on itself rather than on its subject. AHD reached that point twice in three days, and both times what the instrument revealed about itself mattered more than the reading.

AHD, short for Artificial Human Design, names the concrete tells that mark AI-generated design as AI-generated, then lints against them. A deterministic linter reads the HTML and CSS. A vision critic reads the rendered pixels for the tells that only show up there. The premise is deliberately narrow: most generated landing pages fail in the same few dozen ways, and a failure that can be named is a failure that can be linted.

What follows is the first run the project was willing to publish, the two lessons it forced about what an eval actually measures and the run two days later that overturned its result.

The first run worth publishing

Earlier runs were three to five samples per cell. That is enough to prove the pipeline works and nowhere near enough to read a reduction as anything but noise. The 22 April run was the first at thirty samples per cell across ten models, which is enough to separate signal from variance for the one brief and one style token it covered.

The input was a single brief (an editorial product landing page) and a single AHD style token, swiss-editorial, a paper-and-ink Helvetica derivative. Each of ten models generated thirty samples under two conditions: raw, which is the brief alone, and compiled, which is the brief plus the system prompt AHD compiles from the style token. The source-level rule set, thirty-eight rules at the time, then scored all 600 pages.

A cell is one model reached through one serving path. The roster was three frontier cells reached through their subscription command-line tools (Claude Opus 4.7 through Claude Code, GPT-5.4 through Codex, Gemini 3.1 Pro Preview through the Gemini CLI) and seven open-weight cells served by Cloudflare Workers AI.

The result held across nearly the whole roster. The compiled prompt reduced mean tells per page on nine of the ten cells. The biggest drops came where there was the most slop to shed: GPT-OSS-120B fell from 3.50 tells per page to 0.77, a 78 percent cut, and Mistral Small 3.1 dropped from 3.47 to 1.30. Kimi K2.6, Gemini 3.1 Pro and Claude Opus 4.7 followed close behind, each cutting tells by roughly 60 percent. The two cells that barely moved, gpt-5.4 at 19 percent and Qwen3 30B at 9 percent, were already near the floor at baseline.

The one cell that moved the wrong way was Llama 3.3 70B, which rose from 0.28 tells raw to 0.60 compiled. Its raw output is already so thin that the compiled prompt pushes it into adding structure it then gets wrong, mostly on line height and grid rules. That is precisely the signal the linter exists to surface. The right response to a regression is to name it, not to bury it inside an aggregate average.

Bar chart of mean AHD tells per page, raw versus compiled, across ten models, with a key defining raw (the brief alone), compiled (the brief plus AHD's system prompt) and regression (compiled scored worse than raw). GPT-OSS-120B falls from 3.50 to 0.77. Claude Opus 4.7 from 1.80 to 0.73. Llama 3.3 70B rises from 0.28 to 0.60, the one regression. — Mean AHD tells per page, raw versus compiled, on the 22 April swiss-editorial run. Nine of ten cells improve under the compiled prompt; Llama 3.3 70B regresses. Numbers are the framework's published per-cell table.

Three caveats traveled with every number, and they matter more than the topline. One brief and one token is a single point in a large space; a different token or a different brief will reorder the table. The scoring was source-only, covering about three quarters of the taxonomy, with the vision tells left for a later pass. And tells per page is a proxy: a thin page has little surface for a rule to catch, so a drop from 3.50 to 0.77 is not the same kind of evidence as a drop from 1.80 to 0.73. Read the delta next to the rendered page, never on its own.

A model is never just a model

Two methodology lessons came out of that same run, and both reduce to the same sentence: a benchmark target is always a model plus its serving path, and a report that names only the model is answering a question most readers did not ask.

The first lesson was about which surface to measure. We started out treating the API as the canonical frontier measurement and the subscription tools as a cost-saving fallback. That is backwards. Most people using Claude, ChatGPT or Gemini never touch the API. They use the subscription tool, and the tool is not a thin wrapper over the model. It wraps the model in a product: a chat template with the provider’s own defaults, agent prompts that load automatically, tool integrations, project-context discovery, session state, default thinking and tool-use settings. Every one of those layers changes the output. An eval that only calls the raw endpoint holds the weights constant and throws away the surface, which is the part the user actually touches. Both numbers are legitimate. The API shows what the model does; the tool shows what the user gets. The methodology lesson is to carry both rows for each frontier cell rather than pretend one of them is the real one.

The second lesson cost a full rerun to learn. Every compiled Kimi K2.6 sample came back as HTTP 200 with no content. No error, long latency, nothing for the linter to score. Cloudflare ships Kimi with thinking-mode on by default, and on a system prompt this dense the hidden reasoning phase consumed the entire output-token budget before a single visible token emitted. The obvious first move was to strengthen the prompt, to tell the model more firmly not to reason. That did nothing, because a prompt and a chat template are not the same layer of the stack. Plain-language instructions in the prompt arrive after the template has already decided whether to enter the reasoning phase. The fix was a chat-template flag, thinking: false, not a stronger sentence. Telling a model “no reasoning” in prose is like leaving a note for a program and hoping the runtime reads it.

The general shape is familiar from any layered system. Instructions at one layer do not cross into another unless the interface is explicit. We know this about software and forget it about chat models, because the prompt feels like the universal interface. It is not. The template, the temperature, the token ceiling and the provider’s quantization all sit underneath the prompt and cannot be overridden from inside it. An eval that hides that layer is not wrong so much as incomplete. The same model served through a different path is, for measurement, a different cell.

The day the result reversed

Two days later a second run reversed that result, and the reason it reversed is the real subject here.

The triangulation run held the brief, the thirty samples per cell and the two conditions fixed, and moved the one variable that mattered: the token went from swiss-editorial to post-digital-green, a terminal aesthetic, single monospace face, OKLCH green palette, an 80-column character grid, rectangles only. If the compiled prompt generalizes, a different token should still produce reductions. That was the question.

Eleven cells ran, the ten from the first run plus gpt-5.5. Eight of them regressed under the compiled prompt. The three that held positive were the cells with the highest raw baselines: Mistral Small 3.1 and GPT-OSS-120B still reduced, Llama 4 Scout stayed flat. Everything else got worse, the frontier cells most of all. Claude fired 173 percent more tells under the compiled prompt than under the raw brief. gpt-5.4 fired 79 percent more. Gemini, mildest of the three, fired 10 percent more. The rank order of the cells tracked 22 April. The direction of the intervention did not.

The usual suspects are a compiler bug, a bad prompt or too small a sample. None of the three were the cause. Open the compiled samples and each one reads as a textbook post-digital-green page: the green palette, the character grid, the single Berkeley Mono face, sharp rectangles, no display type. The compiler transmitted the token faithfully. The model followed it faithfully. The page was exactly what the token asked for.

Then the linter punished every sample for it.

On the left, a rendered post-digital-green page: phosphor green on near-black, single monospace face, character-grid arrows, rectangles only. On the right, three rule cards. require-type-pairing fired 100 percent on Claude and GPT-5.4 and 83 percent on Gemini. weight-variety fired nearly as often. radius-hierarchy fired on roughly half the Claude samples. — A real compiled post-digital-green page beside the editorial-default rules the linter fired on it. The page is exactly what the token asked for. The rules assume an editorial default the token was built to reject.

Two rules dominate the regression, and each is correct for the context the taxonomy was written against and wrong for the context the token establishes. ahd/require-type-pairing fired on every Claude and GPT-5.4 compiled sample and 83 percent of Gemini’s; its point is that a page with one font family has no typographic voice, but post-digital-green’s whole voice is one monospace face, and the token says so. ahd/weight-variety fired almost as widely, 86 to 100 percent across the same three cells; it reads two font weights as a lack of voice, when here the restraint is the choice. A third token-mismatch rule, ahd/radius-hierarchy, fired on about half the Claude samples, asking for a sharp-versus-soft contrast on a token that forbids radius outright.

One rule that fired was not a token mismatch at all. ahd/respect-reduced-motion fired far more on compiled output than on raw, because the compiled prompt was adding motion without a prefers-reduced-motion guard. That is a real fault in the prompt, not a quarrel with the token, and it stays on the books as one.

Each of the three token-mismatch rules embeds an editorial-design default as a hard assumption. That assumption is invisible under swiss-editorial because the token shares it. It is loud under post-digital-green because the token was built as its opposite.

The compiler and the linter had drifted apart

The finding was not that AHD makes post-digital-green worse. The finding was that the compiler and the linter had drifted into two independent artifacts. The compiler knows which token is active and transmits its intent. The linter applied the same rule set to every token and never read the token’s own constraints. External-validity testing was supposed to separate those two layers, and on the first try it did.

The mismatch lives at the linter layer, not the model layer. Whether the HTML came from Claude or from an open-weight cell, a page with one monospace face fired require-type-pairing. Eight cells across five providers regressed the same way for the same reason, which is what a structural finding looks like rather than a quirk of one model.

The fix, and what it did

The fix shipped the next day. Each token already declares what it requires, things like palette, type, grid and motion. It now carries a second block naming what it overrides, the rules it tells the linter to stand down on for output produced under it. In post-digital-green.yml that block reads:

lint-overrides:
  disable:
    - id: ahd/require-type-pairing
      reason: "single monospace face; pairing rules do not apply"
    - id: ahd/weight-variety
      reason: "conservative weight palette is the design, not an omission"
    - id: ahd/radius-hierarchy
      reason: "border-radius is zero on every element by mandate"

The reason string is where the accountability lives. The schema requires one on every disabled rule, so a token cannot quietly switch off a rule it finds inconvenient; the opt-out sits on the record next to its justification. The compiler ignores the block. The linter reads it, auto-detecting the active token from the page’s ahd-token meta tag, and stands down on exactly the rules that token disabled and no others.

Re-linting the same 660 samples, the same bytes on disk, under the new rules moved the verdict. Three of the eight regressing cells flipped to positive: Gemma, Kimi and Gemini. Together with the three cells that were already positive, six of the eleven now read positive and five still regress. Gemini went from 10 percent worse to 26 percent better. GPT-OSS-120B doubled its reduction to 47 percent and led the run. Nothing about the samples changed. Only the linter did.

Five cells still regress, and that is the more honest half of the result. Claude improved from 173 percent worse to 68 percent worse without crossing zero; gpt-5.4 closed half its gap and stayed negative. The reason is plain: rules outside a token’s override list keep firing, and the reduced-motion gap is real on every token. One of the five, Llama 3.3, sits at a baseline so low that a fifth of a tell reads as a 200 percent swing and says little about the model. Token-aware linting is not a universal pass. It is the linter learning to read one thing it was deaf to before.

The override list reaches the editorial-convention rules and is meant to stop there. Some rules hold for any page regardless of token: contrast for legibility, hit-target size, reflow at small viewports, a structure a screen reader can navigate. Those are not the kind a token has any business switching off, and a token that disabled them would be asking for an inaccessible page. A token-aware linter is not a way to get any page to pass. A token that overrode every editorial rule and shipped a flat monochrome page would be making a design statement the linter would correctly score as clean, and whether that statement is any good is a design review, not a lint result.

What the fix does not settle is external validity. It shows that AHD can recognize a token that opts out of three editorial defaults and stop flagging output for honoring them. It does not show that AHD compilation improves output in general. A token that opts out of nothing reproduces none of this shift, and the test the run still owes is a different brief on the same token.

The serving-path lesson and the token-aware lesson are the same lesson told twice. In the first, the measurement collapsed a model into the path that serves it and lost the differences that decide what a user sees. In the second, the linter collapsed a page into the editorial defaults it assumed for every page and lost the intent the token had declared. Both are the same error: treating context as incidental when the context is the thing.

The posture the project committed to on 22 April still holds. Publish what we have, name what is missing, let the record carry both the result and the incomplete scope that produced it. The reruns that strengthen these claims are queued. When they land they will tighten these numbers or move them, the way this run moved the last one, and the record will show whichever happens.

The first run worth publishing

A model is never just a model

The day the result reversed

The compiler and the linter had drifted apart

The fix, and what it did

What the two failures share