Testing Agent Skills Systematically with Evals

When you’re iterating on a skill for an agent like Codex, it’s hard to tell whether you’re actually improving it or just changing its behavior. One version feels faster, another seems more reliable, and then a regression slips in: the skill doesn’t trigger, it skips a required step, or it leaves extra files behind.
当你在为像 Codex 这样的 agent 迭代技能时，很难判断你是在真正改进它，还是仅仅改变了它的行为。一个版本感觉更快，另一个版本似乎更可靠，然后出现了回归：技能没有触发，跳过了必需的步骤，或者留下了多余的文件。

At its core, a skill is an organized collection of prompts and instructions for an LLM. The most reliable way to improve a skill over time is to evaluate it the same way you would any other prompt for LLM applications.
从本质上讲，技能是为 LLM 组织的一组提示和指令。随着时间推移，改进技能最可靠的方法是像评估其他 LLM 应用的提示一样评估它。

Evals (short for evaluations) check whether a model’s output, and the steps it took to produce it, match what you intended. Instead of asking “does this feel better?” (or relying on vibes), evals let you ask concrete questions like:
Evals（评估的简称）检查模型的输出及其生成步骤是否符合你的预期。它们不是问“感觉更好吗？”（或依赖直觉），而是让你能够提出具体的问题，例如：

Did the agent invoke the skill?
代理是否调用了该技能？
Did it run the expected commands?
它是否运行了预期的命令？
Did it produce outputs that follow the conventions you care about?
它是否产生了符合您关注的规范的输出？

Concretely, an eval is: a prompt → a captured run (trace + artifacts) → a small set of checks → a score you can compare over time.
具体来说，eval 是：一个提示 → 一次捕获的运行（跟踪 + 工件）→ 一小组检查 → 一个可以随时间比较的分数。

In practice, evals for agent skills look a lot like lightweight end-to-end tests: you run the agent, record what happened, and score the result against a small set of rules.
在实际操作中，代理技能的评估看起来很像轻量级的端到端测试：你运行代理，记录发生的情况，并根据一小组规则对结果进行评分。

This post walks through a clear pattern for doing that with Codex, starting from defining success, then adding deterministic checks and rubric-based grading so improvements (and regressions) are clear.
本文介绍了使用 Codex 进行系统测试的清晰模式，首先定义成功标准，然后添加确定性检查和基于评分标准的评分，以便明确改进（和回归）。

1. Define success before you write the skill1. 在编写技能之前定义成功标准

Before writing the skill itself, write down what “success” means in terms you can actually measure. A useful way to think about this is to split your checks into a few categories:
在编写技能本身之前，先写下“成功”在你实际可以衡量的术语中的含义。一种有用的思考方式是将你的检查分成几个类别：

Outcome goals: Did the task complete? Does the app run?
结果目标：任务是否完成？应用程序是否运行？
Process goals: Did Codex invoke the skill and follow the tools and steps you intended?
过程目标：Codex 是否调用了技能并遵循了您预期的工具和步骤？
Style goals: Does the output follow the conventions you asked for?
风格目标：输出是否遵循了您要求的规范？
Efficiency goals: Did it get there without thrashing (for example, unnecessary commands or excessive token use)?
效率目标：是否在没有混乱（例如，不必要的命令或过度使用令牌）的情况下达到目标？

Keep this list small and focused on must-pass checks. The goal isn’t to encode every preference up front, but to capture the behaviors you care about most.
保持此列表简洁，专注于必须通过的检查。目标不是事先编码所有偏好，而是捕捉你最关心的行为。

In this post, for example, the guide evaluates a skill that sets up a demo app. Some checks are concrete. Did it run npm install? Did it create package.json? The guide pairs those with a structured style rubric to evaluate conventions and layout.
在本文中，例如，该指南评估了一个设置演示应用的技能。一些检查是具体的。它是否运行了 npm install ？它是否创建了 package.json ？该指南将这些与结构化的风格评分标准配对，以评估规范和布局。

This mix is intentional. You want fast, targeted signals that surface specific regressions early, rather than a single pass/fail verdict at the end.
这种混合是有意为之。您需要快速、针对性的信号，能够及早发现具体的回归问题，而不是在最后给出单一的通过/失败判定。

2. Create the skill 2. 创建技能

A Codex skill is a directory with a SKILL.md file that includes YAML front matter (name, description), followed by the Markdown instructions that define the skill’s behavior and optional resources and scripts. The name and description matter more than they might seem. They’re the primary signals Codex uses to decide whether to invoke the skill at all, and when to inject the rest of SKILL.md into the agent’s context. If these are vague or overloaded, the skill won’t trigger reliably.
Codex 技能是一个包含 SKILL.md 文件的目录，该文件包括 YAML 头部信息（ name ， description ），后面跟着定义技能行为的 Markdown 说明以及可选的资源和脚本。名称和描述比看起来更重要。它们是 Codex 用来决定是否调用该技能以及何时将其余的 SKILL.md 注入代理上下文的主要信号。如果这些信息模糊或过载，技能将无法可靠触发。

The fastest way to get started is to use Codex’s built-in skill creator (which itself is also a skill). It walks you through:
最快的入门方式是使用 Codex 内置的技能创建器（它本身也是一个技能）。它会引导你完成：

$skill-creator

The creator asks you what the skill does, when it should trigger, and whether it’s instruction-only or script-backed (instruction-only is the default recommendation). To learn more about creating a skill, check out the documentation.
创建者会询问您该技能的功能、触发时机，以及它是仅基于指令还是有脚本支持（默认推荐仅基于指令）。想了解更多关于创建技能的信息，请查看文档。

A sample skill 示例技能

This post uses an intentionally minimal example: a skill that sets up a small React demo app in a predictable, repeatable way.
本文使用了一个刻意简化的示例：一个以可预测、可重复的方式搭建小型 React 演示应用的技能。

This skill will:此技能将：

Scaffold a project using Vite’s React + TypeScript template
使用 Vite 的 React + TypeScript 模板搭建项目脚手架
Configure Tailwind CSS using the official Vite plugin approach
使用官方 Vite 插件方法配置 Tailwind CSS
Enforce a minimal, consistent file structure
强制执行最小且一致的文件结构
Define a clear “definition of done” so success is straightforward to evaluate
定义明确的“完成标准”，以便轻松评估成功与否

Below is a compact draft you can paste either into:
以下是一个简洁的草稿，您可以将其粘贴到：

.codex/skills/setup-demo-app/SKILL.md (repo-scoped), or .codex/skills/setup-demo-app/SKILL.md （仓库范围），或
~/.codex/skills/setup-demo-app/SKILL.md (user-scoped). ~/.codex/skills/setup-demo-app/SKILL.md （用户范围）。

---
name: setup-demo-app
description: Scaffold a Vite + React + Tailwind demo app with a small, consistent project structure.
---

## When to use this

Use when you need a fresh demo app for quick UI experiments or reproductions.

## What to build

Create a Vite React TypeScript app and configure Tailwind. Keep it minimal.

Project structure after setup:

- src/
  - main.tsx (entry)
  - App.tsx (root UI)
  - components/
    - Header.tsx
    - Card.tsx
  - index.css (Tailwind import)
- index.html
- package.json

Style requirements:

- TypeScript components
- Functional components only
- Tailwind classes for styling (no CSS modules)
- No extra UI libraries

## Steps

1. Scaffold with Vite using the React TS template:
   npm create vite@latest demo-app -- --template react-ts

2. Install dependencies:
   cd demo-app
   npm install

3. Install and configure Tailwind using the Vite plugin.
   - npm install tailwindcss @tailwindcss/vite
   - Add the tailwind plugin to vite.config.ts
   - In src/index.css, replace contents with:
     @import "tailwindcss";

4. Implement the minimal UI:
   - Header: app title and short subtitle
   - Card: reusable card container
   - App: render Header + 2 Cards with placeholder text

## Definition of done

- npm run dev starts successfully
- package.json exists
- src/components/Header.tsx and src/components/Card.tsx exist

This sample skill takes an opinionated stance on purpose. Without clear constraints, there’s nothing concrete to evaluate.
此示例技能故意采取了有明确观点的立场。没有明确的约束，就没有具体的评估标准。

3. Manually trigger the skill to expose hidden assumptions3. 手动触发技能以暴露隐藏的假设

Because skill invocation depends so much on the name and description in SKILL.md, the first thing to check is whether the setup-demo-app skill triggers when you expect it to.
由于技能调用在很大程度上依赖于 SKILL.md 中的名称和描述，首先要检查的是 setup-demo-app 技能是否在预期时触发。

Early on, explicitly activate the skill, either via the /skills slash command or by referencing it with the $ prefix, in a real repository or a scratch directory, and watch where it breaks. This is where you surface the misses: cases where the skill doesn’t trigger at all, triggers too eagerly, or runs but deviates from the intended steps.
在早期阶段，通过 /skills 斜杠命令或使用 $ 前缀显式激活技能，在真实仓库或临时目录中运行，并观察其出错的位置。这是发现遗漏的地方：技能完全未触发、触发过于频繁，或运行时偏离预期步骤的情况。

At this stage, you’re not optimizing for speed or polish. You’re looking for hidden assumptions the skill is making, such as:
此阶段不追求速度或完善度。你是在寻找技能所做的隐藏假设，例如：

Triggering assumptions: Prompts like “set up a quick React demo” that should invoke setup-demo-app but don’t, or more generic prompts (“add Tailwind styling”) that unintentionally trigger it.
触发假设：例如“设置一个快速的 React 演示”这类本应调用 setup-demo-app 但未调用的提示，或更通用的提示（“添加 Tailwind 样式”）意外触发它。
Environment assumptions: The skill assumes it’s running in an empty directory, or that npm is available and preferred over other package managers.
环境假设：该技能假设它运行在一个空目录中，或者 npm 可用且优先于其他包管理器。
Execution assumptions: The agent skips npm install because it assumes dependencies are already installed, or configures Tailwind before the Vite project exists.
执行假设：代理跳过 npm install ，因为它假设依赖项已经安装，或者在 Vite 项目存在之前配置 Tailwind。

Once you’re ready to make these runs repeatable, switch to codex exec. It’s designed for automation and CI: it streams progress to stderr and writes only the final result to stdout, which makes runs easier to script, capture, and inspect.
一旦准备好使这些运行可重复，切换到 codex exec 。它专为自动化和 CI 设计：它将进度流式传输到 stderr ，并且只将最终结果写入 stdout ，这使得运行更易于编写脚本、捕获和检查。

By default, codex exec runs in a restricted sandbox. If your task needs to write files, run it with --full-auto. As a general rule, especially when automating, use the least permissions needed to get the job done.
默认情况下， codex exec 在受限沙箱中运行。如果您的任务需要写入文件，请使用 --full-auto 运行。作为一般规则，尤其是在自动化时，使用完成任务所需的最低权限。

A basic manual run might look like:
一个基本的手动运行可能如下所示：

codex exec --full-auto \
  'Use the $setup-demo-app skill to create the project in this directory.'

This first hands-on pass is less about validating correctness and more about discovering edge cases. Every manual fix you make here, such as adding a missing npm install, correcting the Tailwind setup, or tightening the trigger description, is a candidate for a future eval, so you can lock in the intended behavior before evaluating at scale.
这第一次动手操作更多的是发现边缘情况，而不是验证正确性。你在这里做的每一个手动修正，比如添加缺失的 npm install 、修正 Tailwind 设置，或是完善触发器描述，都是未来评估的候选项，这样你就可以在大规模评估之前锁定预期行为。

4. Use a small, targeted prompt set to catch regressions early4. 使用一小组有针对性的提示集来及早捕捉回归问题

You don’t need a large benchmark to get value from evals. For a single skill, a small set of 10–20 prompts is enough to surface regressions and confirm improvements early.
您不需要大型基准测试就能从 evals 中获得价值。对于单一技能，10 到 20 个提示的小型集合就足以及早发现回归并确认改进。

Start with a small CSV and grow it over time as you encounter real failures during development or usage. Each row should represent a situation where you care whether the setup-demo-app skill does or does not activate, and what success looks like when it does.
从一个小的 CSV 开始，随着开发或使用过程中遇到的真实失败逐步扩展。每一行应代表一个你关心 setup-demo-app 技能是否激活的情况，以及当它激活时成功的标准。

For example, an initial evals/setup-demo-app.prompts.csv might look like this:
例如，初始的 evals/setup-demo-app.prompts.csv 可能如下所示：

id,should_trigger,prompt
test-01,true,"Create a demo app named \`devday-demo\` using the $setup-demo-app skill"
test-02,true,"Set up a minimal React demo app with Tailwind for quick UI experiments"
test-03,true,"Create a small demo app to showcase the Responses API"
test-04,false,"Add Tailwind styling to my existing React app"

Each of these cases is testing something slightly different:
每个案例测试的内容略有不同：

Explicit invocation (test-01)
显式调用（ test-01 ）
This prompt names the skill directly. It ensures that Codex can invoke setup-demo-app when asked, and that changes to the skill’s name, description, or instructions don’t break direct usage.
此提示直接命名技能。它确保 Codex 在被询问时能够调用 setup-demo-app ，并且技能名称、描述或说明的更改不会影响直接使用。
Implicit invocation (test-02)
隐式调用（ test-02 ）
This prompt describes exactly the scenario the skill targets, setting up a minimal React + Tailwind demo, without mentioning the skill by name. It tests whether the name and description in SKILL.md are strong enough for Codex to select the skill on its own.
此提示准确描述了技能所针对的场景，搭建了一个最小的 React + Tailwind 演示，但未提及技能名称。它测试 SKILL.md 中的名称和描述是否足够强大，使 Codex 能够自行选择该技能。
Contextual invocation (test-03)
上下文调用（ test-03 ）
This prompt adds domain context (the Responses API) but still requires the same underlying setup. It checks that the skill triggers in realistic, slightly noisy prompts, and that the resulting app still matches the expected structure and conventions.
此提示添加了领域上下文（Responses API），但仍然需要相同的基础设置。它检查技能是否在现实中带有轻微噪声的提示中触发，以及生成的应用程序是否仍然符合预期的结构和规范。
Negative control (test-04)
负控制（ test-04 ）
This prompt should not invoke setup-demo-app. It’s a common adjacent request (“add Tailwind to an existing app”) that can unintentionally match the skill’s description (“React + Tailwind demo”). Including at least one should_trigger=false case helps catch false positives, where Codex selects the skill too eagerly and scaffolds a new project when the user wanted an incremental change to an existing one.
此提示不应调用 setup-demo-app 。这是一个常见的相邻请求（“向现有应用添加 Tailwind”），可能会无意中匹配技能描述（“React + Tailwind 演示”）。包含至少一个 should_trigger=false 案例有助于捕捉误报，即 Codex 过于积极地选择技能并搭建新项目，而用户实际上想对现有项目进行增量更改。

This mix is intentional. Some evals should confirm that the skill behaves correctly when invoked explicitly; others should check that it activates in real-world prompts where the user never mentions the skill at all.
这种混合是有意为之的。有些评估应确认技能在被明确调用时表现正确；而另一些则应检查技能是否会在用户根本未提及该技能的真实场景提示中激活。

As you discover misses, prompts that fail to trigger the skill, or cases where the output drifts from your expectations, add them as new rows. Over time, this small CSV becomes a living record of the scenarios the setup-demo-app skill must continue to get right.
当你发现遗漏、未能触发技能的提示，或输出偏离预期的情况时，将它们作为新行添加。随着时间推移，这个小型 CSV 文件将成为 setup-demo-app 技能必须持续正确处理场景的动态记录。

Over time, this small dataset becomes a living record of what the skill must continue to get right.
随着时间推移，这个小型数据集将成为技能必须持续正确处理内容的动态记录。

5. Get started with lightweight deterministic graders5. 开始使用轻量级确定性评分器

This is the core of the evaluation step: use codex exec --json so your eval harness can score what actually happened, not just whether the final output looks right.
这是评估步骤的核心：使用 codex exec --json ，使你的评估框架能够评分实际发生的情况，而不仅仅是最终输出是否看起来正确。

When you enable --json, stdout becomes a JSONL stream of structured events. That makes it straightforward to write deterministic checks tied directly to the behavior you care about, for example:
当您启用 --json 时， stdout 会变成一个结构化事件的 JSONL 流。这使得编写与您关心的行为直接相关的确定性检查变得简单，例如：

Did it run npm install?
它运行了 npm install 吗？
Did it create package.json?
它创建了 package.json 吗？
Did it invoke the expected commands, in the expected order?
它是否按预期顺序调用了预期的命令？

These checks are intentionally lightweight. They give you fast, explainable signals before you add any model-based grading.
这些检查故意设计得很轻量。它们在你添加任何基于模型的评分之前，提供快速且可解释的信号。

A minimal Node.js runner 一个最小的 Node.js 运行器

A “good enough” approach looks like this:
“足够好”的方法如下：

For each prompt, run codex exec --json --full-auto "<prompt>"
对于每个提示，运行 codex exec --json --full-auto "<prompt>"
Save the JSONL trace to disk
将 JSONL 跟踪保存到磁盘
Parse the trace and run deterministic checks over the events
解析跟踪并对事件执行确定性检查

// evals/run-setup-demo-app-evals.mjs
import { spawnSync } from "node:child_process";
import { readFileSync, writeFileSync, existsSync, mkdirSync } from "node:fs";
import path from "node:path";

function runCodex(prompt, outJsonlPath) {
  const res = spawnSync(
    "codex",
    [
      "exec",
      "--json", // REQUIRED: emit structured events
      "--full-auto", // Allow file system changes
      prompt,
    ],
    { encoding: "utf8" }
  );

  mkdirSync(path.dirname(outJsonlPath), { recursive: true });

  // stdout is JSONL when --json is enabled
  writeFileSync(outJsonlPath, res.stdout, "utf8");

  return { exitCode: res.status ?? 1, stderr: res.stderr };
}

function parseJsonl(jsonlText) {
  return jsonlText
    .split("\n")
    .filter(Boolean)
    .map((line) => JSON.parse(line));
}

// deterministic check: did the agent run \`npm install\`?
function checkRanNpmInstall(events) {
  return events.some(
    (e) =>
      (e.type === "item.started" || e.type === "item.completed") &&
      e.item?.type === "command_execution" &&
      typeof e.item?.command === "string" &&
      e.item.command.includes("npm install")
  );
}

// deterministic check: did \`package.json\` get created?
function checkPackageJsonExists(projectDir) {
  return existsSync(path.join(projectDir, "package.json"));
}

// Example single-case run
const projectDir = process.cwd();
const tracePath = path.join(projectDir, "evals", "artifacts", "test-01.jsonl");

const prompt =
  "Create a demo app named demo-app using the $setup-demo-app skill";

runCodex(prompt, tracePath);

const events = parseJsonl(readFileSync(tracePath, "utf8"));

console.log({
  ranNpmInstall: checkRanNpmInstall(events),
  hasPackageJson: checkPackageJsonExists(path.join(projectDir, "demo-app")),
});

The value here is that everything is deterministic and debuggable.
这里的价值在于一切都是确定性的且可调试的。

If a check fails, you can open the JSONL file and see exactly what happened. Every command execution appears as an item.* event, in order. That makes regressions straightforward to explain and fix, which is exactly what you want at this stage.
如果检查失败，您可以打开 JSONL 文件，准确查看发生了什么。每个命令执行都会按顺序显示为一个 item.* 事件。这使得回归问题易于解释和修复，这正是您在此阶段所需要的。

6. Conduct qualitative checks with Codex and rubric-based grading6. 使用 Codex 和基于评分标准的评分进行定性检查

Deterministic checks answer “did it do the basics?” but they don’t answer “did it do it the way you wanted?”
确定性检查回答“它是否完成了基础任务？”，但并不回答“它是否以你想要的方式完成了？”

For skills like setup-demo-app, many requirements are qualitative: component structure, styling conventions, or whether Tailwind follows the intended configuration. These are hard to capture with basic file existence checks or command counts alone.
对于像 setup-demo-app 这样的技能，许多需求是定性的：组件结构、样式规范，或者 Tailwind 是否遵循预期的配置。这些仅靠基本的文件存在检查或命令计数很难捕捉。

A pragmatic solution is to add a second, model-assisted step to your eval pipeline:
一个务实的解决方案是在你的评估流程中添加第二个由模型辅助的步骤：

Run the setup skill (this writes code to disk)
运行设置技能（这会将代码写入磁盘）
Run a read-only style check against the resulting repository
对生成的代码库运行只读的样式检查
Require a structured response that your harness can score consistently
要求结构化响应，以便您的测试框架能够一致地评分

Codex supports this directly via --output-schema, which constrains the final response to a JSON Schema you define.
Codex 通过 --output-schema 直接支持这一点，该功能将最终响应限制为您定义的 JSON Schema。

A small rubric schema 一个小型评分标准模式

Start by defining a small schema that captures the checks you care about. For example, create evals/style-rubric.schema.json:
首先定义一个包含你关心的检查项的小型架构。例如，创建 evals/style-rubric.schema.json ：

{
  "type": "object",
  "properties": {
    "overall_pass": { "type": "boolean" },
    "score": { "type": "integer", "minimum": 0, "maximum": 100 },
    "checks": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "id": { "type": "string" },
          "pass": { "type": "boolean" },
          "notes": { "type": "string" }
        },
        "required": ["id", "pass", "notes"],
        "additionalProperties": false
      }
    }
  },
  "required": ["overall_pass", "score", "checks"],
  "additionalProperties": false
}

This schema gives you stable fields (overall_pass, score, per-check results) that you can combine, diff, and track over time.
该架构为您提供了稳定的字段（ overall_pass 、 score 、每次检查结果），您可以将它们组合、比较差异并随时间跟踪。

The style-check prompt 样式检查提示

Next, run a second codex exec that only inspects the repository and emits a rubric-compliant JSON response:
接下来，运行第二个 codex exec ，该 codex exec 仅检查代码库并输出符合评分标准的 JSON 响应：

codex exec \
  "Evaluate the demo-app repository against these requirements:
   - Vite + React + TypeScript project exists
   - Tailwind is configured via @tailwindcss/vite and CSS imports tailwindcss
   - src/components contains Header.tsx and Card.tsx
   - Components are functional and styled with Tailwind utility classes (no CSS modules)
   Return a rubric result as JSON with check ids: vite, tailwind, structure, style." \
  --output-schema ./evals/style-rubric.schema.json \
  -o ./evals/artifacts/test-01.style.json

This is where --output-schema is handy. Instead of free-form text that’s hard to parse or compare, you get a predictable JSON object that your eval harness can score across many runs.
这就是 --output-schema 派上用场的地方。你得到的是一个可预测的 JSON 对象，而不是难以解析或比较的自由格式文本，你的评估框架可以在多次运行中对其进行评分。

If you later move this eval suite into CI, the Codex GitHub Action explicitly supports passing --output-schema through codex-args, so you can enforce the same structured output in automated workflows.
如果您稍后将此评估套件移入 CI，Codex GitHub Action 明确支持通过 codex-args 传递 --output-schema ，因此您可以在自动化工作流中强制执行相同的结构化输出。

7. Extending your evals as the skill matures7. 随着技能的成熟扩展您的评估

Once you have the core loop in place, you can extend your evals in the directions that matter most for your skill. Start small, then layer in deeper checks only where they add real confidence.
一旦核心循环建立起来，你就可以朝着对你的技能最重要的方向扩展你的评估。先从小处开始，然后仅在能够真正增加信心的地方加入更深入的检查。

Some examples include:一些示例包括：

Command count and thrashing: Count command_execution items in the JSONL trace to catch regressions where the agent starts looping or re-running commands. Token usage is also available in turn.completed events.
命令计数和抖动：统计 JSONL 跟踪中的 command_execution 项，以捕捉代理开始循环或重新运行命令的回归情况。令牌使用情况也可在 turn.completed 事件中查看。
Token budget: Track usage.input_tokens and usage.output_tokens to spot accidental prompt bloat and compare efficiency across versions.
令牌预算：跟踪 usage.input_tokens 和 usage.output_tokens ，以发现意外的提示膨胀并比较不同版本的效率。
Build checks: Run npm run build after the skill completes. This acts as a stronger end-to-end signal and catches broken imports or incorrectly configured tooling.
构建检查：技能完成后运行 npm run build 。这作为更强的端到端信号，能够捕捉导入错误或配置不当的工具链。
Runtime smoke checks: Start npm run dev and hit the dev server with curl, or run a lightweight Playwright check if you already have one. Use this selectively. It adds confidence but costs time.
运行时冒烟测试：启动 npm run dev 并使用 curl 访问开发服务器，或者如果已有轻量级 Playwright 检查，则运行该检查。请有选择地使用。它能增加信心，但会消耗时间。
Repository cleanliness: Ensure the run generates no unwanted files and that git status --porcelain is empty (or matches an explicit allow list).
仓库整洁度：确保运行过程中不生成任何不需要的文件，并且 git status --porcelain 为空（或符合明确的允许列表）。
Sandbox and permission regressions: Verify the skill still works without escalating permissions beyond what you intended. Least-privilege defaults matter most once you automate.
沙箱和权限回归：验证技能在不提升超出预期权限的情况下仍能正常工作。一旦实现自动化，最小权限默认设置尤为重要。

The pattern is consistent: begin with fast checks that explain behavior, then add slower, heavier checks only when they reduce risk.
模式是一致的：先进行快速检查以解释行为，然后仅在能降低风险时添加较慢、较重的检查。

8. Key takeaways 8. 关键要点

This small setup-demo-app example shows the shift from “it feels better” to “proof”: run the agent, record what happened, and grade it with a small set of checks. Once that loop exists, every tweak becomes easier to confirm, and every regression becomes clear. Here are the key takeaways:
这个小型 setup-demo-app 示例展示了从“感觉更好”到“有证据”的转变：运行代理，记录发生的情况，并用一小组检查项进行评分。一旦这个循环存在，每次调整都更容易确认，每次回归都变得清晰。以下是关键要点：

Measure what matters. Good evals make regressions clear and failures explainable.
衡量重要的内容。良好的评估使回归变得清晰，失败变得可解释。
Start from a checkable definition of done. Use $skill-creator to bootstrap, then tighten the instructions until success is unambiguous.
从一个可检查的完成定义开始。使用 $skill-creator 启动，然后收紧指令，直到成功无歧义。
Ground evals in behavior. Capture JSONL with codex exec --json and write deterministic checks against command_execution events.
将评估基于行为。使用 codex exec --json 捕获 JSONL，并针对 command_execution 事件编写确定性检查。
Use Codex where rules fall short. Add a structured, rubric-based pass with --output-schema to grade style and conventions reliably.
在规则不足时使用 Codex。添加一个基于结构化评分标准的 --output-schema 过程，以可靠地评估风格和规范。
Let real failures drive coverage. Every manual fix is a signal. Turn it into a test so the skill keeps getting it right.
让真实失败推动覆盖率。每次手动修复都是一个信号。将其转化为测试，以便技能持续正确执行。