Evaluation best practices | OpenAI API

本页介绍了如何设计 AI 系统的评估(evals),包括评估流程、不同架构下的评估要点、评估类型及处理边缘情况,以应对 AI 输出的可变性。

Generative AI is variable. Models sometimes produce different output from the same input, which makes traditional software testing methods insufficient for AI architectures. Evaluations (evals) are a way to test your AI system despite this variability.

生成式AI存在可变性。模型有时会对相同的输入产生不同的输出,这使得传统的软件测试方法对于AI架构来说不够充分。评估(evals)是一种尽管存在这种可变性,也能测试你的AI系统的方式。

This guide provides high-level guidance on designing evals. To get started with the Evals API, see evaluating model performance.

本指南提供了关于设计评估(evals)的高级指导。要开始使用 Evals API,请参见 评估模型性能

OpenAI is deprecating the Evals platform. Existing evals content remains available during the transition window. Evals will become read-only for existing users on October 31, 2026, and the platform is scheduled to shut down on November 30, 2026. See the deprecations page for the current timeline.

OpenAI 正在弃用 Evals 平台。在过渡期内,现有的 Evals 内容仍然可用。对于现有用户,Evals 将于 2026 年 10 月 31 日转为只读模式,平台计划于 2026 年 11 月 30 日关闭。当前时间线请参阅弃用页面

What are evals?

什么是evals?

Evals are structured tests for measuring a model’s performance. They help ensure accuracy, performance, and reliability, despite the nondeterministic nature of AI systems. They’re also one of the only ways to improve performance of an LLM-based application (through fine-tuning).

评估是用于衡量模型性能的结构化测试。尽管AI系统具有非确定性本质,它们有助于确保准确性、性能和可靠性。它们也是少数几种能够改进基于LLM的应用性能的方式之一(通过微调)。

Types of evals

评估类型

When you see the word “evals,” it could refer to a few things:

当你看到单词“evals”时,它可能指代以下几件事:

  • Industry benchmarks for comparing models in isolation, like MMLU and those listed on HuggingFace’s leaderboard
  • Standard numerical scores—like ROUGE, BERTScore —that you can use as you design evals for your use case
  • Specific tests you implement to measure your LLM application’s performance
  • 用于单独比较模型性能的行业基准,例如 MMLU 以及 HuggingFace 排行榜 上列出的基准
  • 标准数值评分——例如 ROUGEBERTScore——可用于为你的使用场景设计评估时使用
  • 你实施的具体测试,用于衡量 LLM 应用的表现

This guide is about the third type: designing your own evals.

本指南讨论的是第三种类型:设计你自己的评估。

How to read evals

如何阅读评估

You’ll often see numerical eval scores between 0 and 1. There’s more to evals than just scores. Combine metrics with human judgment to ensure you’re answering the right questions.

你经常看到介于0和1之间的数值评估分数。评估不仅仅是分数。将指标与人类判断相结合,以确保你在回答正确的问题。

Evals tips

评估技巧

  • Adopt eval-driven development: Evaluate early and often. Write scoped tests at every stage.
  • Design task-specific evals: Make tests reflect model capability in real-world distributions.
  • Log everything: Log as you develop so you can mine your logs for good eval cases.
  • Automate when possible: Structure evaluations to allow for automated scoring.
  • It’s a journey, not a destination: Evaluation is a continuous process.
  • Maintain agreement: Use human feedback to calibrate automated scoring.
  • 采用评估驱动开发:尽早评估、频繁评估。在每个阶段编写范围明确的测试。
  • 设计任务特定的评估:使测试能够反映模型在真实世界分布中的能力。
  • 记录一切:开发时做好日志记录,以便从中挖掘优质的评估案例。
  • 尽可能自动化:构建评估结构,使其支持自动评分。
  • 这是一个持续的过程,而非终点:评估是持续迭代的工作。
  • 保持一致:利用人工反馈校准自动评分。

Anti-patterns

反模式

  • Overly generic metrics: Relying solely on academic metrics like perplexity or BLEU score.
  • Biased design: Creating eval datasets that don’t faithfully reproduce production traffic patterns.
  • Vibe-based evals: Using “it seems like it’s working” as an evaluation strategy, or waiting until you ship before implementing any evals.
  • Ignoring human feedback: Not calibrating your automated metrics against human evals.
  • 过于通用的指标:仅依赖困惑度或BLEU分数等学术指标。
  • 有偏设计:构建无法忠实还原实际流量模式的评估数据集。
  • 凭感觉评估:将“看起来好像能行”作为评估标准,或等到产品上线前才匆匆忙忙建立评估机制。
  • 忽略人类反馈:未根据人类评估来校准自动化指标。

Design your eval process

设计您的评估流程

There are a few important components of an eval workflow:

评估工作流有几个重要组成部分:

  1. Define eval objective. What’s the success criteria for the eval?
  2. Collect dataset. Which data will help you evaluate against your objective? Consider synthetic eval data, domain-specific eval data, purchased eval data, human-curated eval data, production data, and historical data.
  3. Define eval metrics. How will you check that the success criteria are met?
  4. Run and compare evals. Iterate and improve model performance for your task or system.
  5. Continuously evaluate. Set up continuous evaluation (CE) to run evals on every change, monitor your app to identify new cases of nondeterminism, and grow the eval set over time.
  1. 定义评估目标。评估成功的标准是什么?
  2. 收集数据集。哪些数据有助于你根据目标进行评估?考虑合成评估数据、特定领域评估数据、购买评估数据、人工筛选评估数据、生产数据以及历史数据。
  3. 定义评估指标。你将如何检查成功标准是否达成?
  4. 运行并比较评估。为你的任务或系统迭代并改进模型性能。
  5. 持续评估。建立持续评估机制,对每次变更运行评估,监控应用以识别新的非确定性情况,并随时间推移扩展评估集。

Let’s run through a few examples.

让我们来看几个例子。

Example: Summarizing transcripts

示例:转录文本摘要

To test your LLM-based application’s ability to summarize transcripts, your eval design might be:

要测试你的基于LLM的应用对转录文本的总结能力,你的评估设计可能是:

  1. Define eval objective
    The model should be able to compete with reference summaries for relevance and accuracy.
  2. Collect dataset
    Use a mix of production data (collected from user feedback on generated summaries) and datasets created by domain experts (writers) to determine a “good” summary.
  3. Define eval metrics
    On a held-out set of 1000 reference transcripts → summaries, the implementation should achieve a ROUGE-L score of at least 0.40 and coherence score of at least 80% using G-Eval.
  4. Run and compare evals
    Use the Evals API to create and run evals in the OpenAI dashboard.
  5. Continuously evaluate
    Set up continuous evaluation (CE) to run evals on every change, monitor your app to identify new cases of nondeterminism, and grow the eval set over time.
  1. 定义评估目标
    模型应能够与参考摘要就相关性和准确性展开竞争。
  2. 收集数据集
    混合使用生产数据(从用户对生成摘要的反馈中收集)以及领域专家(写作者)创建的数据集,以确定何谓“好的”摘要。
  3. 定义评估指标
    在包含1000条参考转录文本→摘要的保留集上,该实现应使用G-Eval达到至少0.40的ROUGE-L分数和至少80%的连贯性分数。
  4. 运行并比较评估
    使用Evals API在OpenAI仪表盘中创建并运行评估。
  5. 持续评估
    设置持续评估(CE),以便在每次变更时运行评估,监控您的应用程序以识别新的非确定性案例,并随着时间的推移扩大评估集。

LLMs are better at discriminating between options. Therefore, evaluations should focus on tasks like pairwise comparisons, classification, or scoring against specific criteria instead of open-ended generation. Aligning evaluation methods with LLMs’ strengths in comparison leads to more reliable assessments of LLM outputs or model comparisons.

LLM更擅长区分不同选项。因此,评估应侧重于成对比较、分类或按特定标准评分等任务,而非开放式生成。将评估方法与LLM在比较上的优势对齐,能更可靠地评估LLM的输出或进行模型比较。

Example: Q&A over docs

示例:基于文档的问答

To test your LLM-based application’s ability to do Q&A over docs, your eval design might be:

为了测试你的基于LLM的应用程序在文档上进行问答的能力,你的评估设计可能是:

  1. Define eval objective
    The model should be able to provide precise answers, recall context as needed to reason through user prompts, and provide an answer that satisfies the user’s need.
  2. Collect dataset
    Use a mix of production data (collected from users’ satisfaction with answers provided to their questions), hard-coded correct answers to questions created by domain experts, and historical data from logs.
  3. Define eval metrics
    Context recall of at least 0.85, context precision of over 0.7, and 70+% positively rated answers.
  4. Run and compare evals
    Use the Evals API to create and run evals in the OpenAI dashboard.
  5. Continuously evaluate
    Set up continuous evaluation (CE) to run evals on every change, monitor your app to identify new cases of nondeterminism, and grow the eval set over time.
  1. 定义评估目标
    模型应能够提供精确的答案,根据需要回忆上下文以推理用户的提示,并给出满足用户需求的回答。
  2. 收集数据集
    混合使用生产数据(从用户对其问题所获答案的满意度中收集)、领域专家创建的硬编码正确答案以及历史日志数据。
  3. 定义评估指标
    上下文召回率不低于 0.85,上下文精确率超过 0.7,且 70% 以上的答案获得正面评价。
  4. 运行并比较评估
    使用 Evals API 在 OpenAI 仪表盘中创建并运行评估。
  5. 持续评估
    建立持续评估 (CE) 机制,针对每次变更运行评估,监控应用程序以发现新的非确定性情况,并随时间推移扩展评估集。

When creating an eval dataset,

在创建评估数据集时,

gpt-5.5

is useful for collecting eval examples and edge cases. Consider using it to help you generate a diverse set of test data across various scenarios. Ensure your test data includes typical cases, edge cases, and adversarial cases. Use human expert labellers.

这对收集评估示例和边缘案例很有用。考虑用它来帮助你生成跨各种场景的多样化测试数据集。确保你的测试数据包含典型情况、边缘情况和对抗情况。使用人类专家标注员。

Identify where you need evals

确定你需要评估的地方

Complexity increases as you move from simple to more complex architectures. Here are four common architecture patterns:

随着从简单架构向更复杂架构的演进,复杂性也随之增加。以下是四种常见的架构模式:

Read about each architecture below to identify where nondeterminism enters your system. That’s where you’ll want to implement evals.

阅读下面每种架构,找出非确定性进入你系统的地方。那就是你想要实现评估的地方。

Single-turn model interactions

单轮模型交互

In this kind of architecture, the user provides input to the model, and the model processes these inputs (along with any developer prompts provided) to generate a corresponding output.

在这种架构中,用户向模型提供输入,模型会处理这些输入(以及任何开发者提供的提示词)并生成相应的输出。

Example

示例

As an example, consider an online retail scenario. Your system prompt instructs the model to categorize the customer’s question into one of the following:

作为示例,考虑一个在线零售场景。你的系统提示指示模型将客户的问题归类为以下之一:

  • order_status
  • return_policy
  • technical_issue
  • cancel_order
  • other
  • order_status订单状态
  • return_policy退货政策
  • technical_issue技术问题
  • cancel_order取消订单
  • other其他

To ensure a consistent, efficient user experience, the model should only return the label that matches user intent. Let’s say the customer asks, “What’s the status of my order?”

为了确保一致且高效的用户体验,模型应仅返回与用户意图匹配的标签。假设客户询问:“我的订单状态如何?”

Nondeterminism introducedCorresponding area to evaluateExample eval questions
Inputs provided by the developer and userInstruction following: Does the model accurately understand and act according to the provided instructions? Instruction following: Does the model prioritize the system prompt over a conflicting user prompt?Does the model stay focused on the triage task or get swayed by the user’s question?
Outputs generated by the modelFunctional correctness: Are the model’s outputs accurate, relevant, and thorough enough to fulfill the intended task or objective?Does the model’s determination of intent correctly match the expected intent?

Workflow architectures

工作流架构

As you look to solve more complex problems, you’ll likely transition from a single-turn model interaction to a multistep workflow that chains together several model calls. Workflows don’t introduce any new elements of nondeterminism, but they involve multiple underlying model interactions, which you can evaluate in isolation.

随着您着手解决更复杂的问题,您可能会从单轮模型交互过渡到将多个模型调用串联起来的多步骤工作流。工作流不会引入任何新的不确定性因素,但由于涉及多个底层模型交互,您可以对这些交互进行独立评估。

Example

示例

Take the same example as before, where the customer asks about their order status. A workflow architecture triages the customer request and routes it through a step-by-step process:

沿用之前的例子,客户询问订单状态。工作流架构对客户请求进行分诊,并通过逐步流程进行路由:

  1. Extracting an Order ID
  2. Looking up the order details
  3. Providing the order details to a model for a final response
  1. 提取订单ID
  2. 查找订单详情
  3. 将订单详情提供给模型以生成最终回复

Each step in this workflow has its own system prompt that the model must follow, putting all fetched data into a friendly output.

此工作流程中的每一步都有其自己的系统提示,模型必须遵循该提示,将所有获取的数据转化为友好的输出。

Nondeterminism introducedCorresponding area to evaluateExample eval questions
Inputs provided by the developer and userInstruction following: Does the model accurately understand and act according to the provided instructions? Instruction following: Does the model prioritize the system prompt over a conflicting user prompt?Does the model stay focused on the triage task or get swayed by the user’s question? Does the model follow instructions to attempt to extract an Order ID? Does the final response include the order status, estimated arrival date, and tracking number?
Outputs generated by the modelFunctional correctness: Are the model’s outputs are accurate, relevant, and thorough enough to fulfill the intended task or objective?Does the model’s determination of intent correctly match the expected intent? Does the final response have the correct order status, estimated arrival date, and tracking number?

Single-agent architectures

单智能体架构

Unlike workflows, agents solve unstructured problems that require flexible decision making. An agent has instructions and a set of tools and dynamically selects which tool to use. This introduces a new opportunity for nondeterminism.

不同于工作流,代理能够解决需要灵活决策的非结构化问题。代理拥有指令和一系列工具,并动态选择使用哪个工具。这引入了非确定性的新可能。

Tools are developer defined chunks of code that the model can execute. This can range from small helper functions to API calls for existing services. For example, check_order_status(order_id) could be a tool, where it takes the argument order_id and calls an API to check the order status.

工具是开发者定义的、模型可以执行的代码块。这些代码可以是小型辅助函数,也可以是对现有服务的API调用。例如,check_order_status(order_id) 可以是一个工具,它接受参数 order_id 并调用API来检查订单状态。

Example

示例

Let’s adapt our customer service example to use a single agent. The agent has access to three distinct tools:

  • Order lookup tool
  • Password reset tool
  • Product FAQ tool
  • 订单查询工具
  • 密码重置工具
  • 产品常见问题解答工具

When the customer asks about their order status, the agent dynamically decides to either invoke a tool or respond to the customer. For example, if the customer asks, “What is my order status?” the agent can now follow up by requesting the order ID from the customer. This helps create a more natural user experience.

当客户询问订单状态时,客服人员可动态决定是调用工具还是直接回复客户。例如,若客户提问“我的订单状态如何?”,客服现在可以进一步向客户索要订单ID。这有助于创建更自然的用户体验。

NondeterminismCorresponding area to evaluateExample eval questions
Inputs provided by the developer and userInstruction following: Does the model accurately understand and act according to the provided instructions? Instruction following: Does the model prioritize the system prompt over a conflicting user prompt?Does the model stay focused on the triage task or get swayed by the user’s question? Does the model follow instructions to attempt to extract an Order ID?
Outputs generated by the modelFunctional correctness: Are the model’s outputs are accurate, relevant, and thorough enough to fulfill the intended task or objective?Does the model’s determination of intent correctly match the expected intent?
Tools chosen by the modelTool selection: Evaluations that test whether the agent is able to select the correct tool to use. Data precision: Evaluations that verify the agent calls the tool with the correct arguments. Typically these arguments are extracted from the conversation history, so the goal is to validate this extraction was correct.When the user asks about their order status, does the model correctly recommend invoking the order lookup tool? Does the model correctly extract the user-provided order ID to the lookup tool?

Multi-agent architectures

多智能体架构

As you add tools and tasks to your single-agent architecture, the model may struggle to follow instructions or select the correct tool to call. Multi-agent architectures help by creating several distinct agents who specialize in different areas. This triaging and handoff among multiple agents introduces a new opportunity for nondeterminism.

当您向单智能体架构添加工具和任务时,模型可能会难以遵循指令或选择正确的工具进行调用。多智能体架构通过创建多个专注于不同领域的独特智能体来提供帮助。这种多个智能体之间的分类和交接引入了新的不确定性机会。

The decision to use a multi-agent architecture should be driven by your evals. Starting with a multi-agent architecture adds unnecessary complexity that can slow down your time to production.

采用多智能体架构的决定应由你的评估结果来驱动。一开始就使用多智能体架构会增加不必要的复杂性,从而拖慢你投产的速度。

Example

示例

Splitting the single-agent example into a multi-agent architecture, we’ll have four distinct agents:

将单代理示例拆分为多代理架构,我们将拥有四个不同的代理:

  1. Triage agent
  2. Order agent
  3. Account management agent
  4. Sales agent
  1. 分流代理
  2. 订单代理
  3. 账户管理代理
  4. 销售代理

When the customer asks about their order status, the triage agent may hand off the conversation to the order agent to look up the order. If the customer changes the topic to ask about a product, the order agent should hand the request back to the triage agent, who then hands off to the sales agent to fetch product information.

当客户询问订单状态时,分流代理可将对话转交给订单代理以查询订单。若客户转而询问产品信息,订单代理应将请求交回分流代理,再由分流代理转交给销售代理获取产品详情。

NondeterminismCorresponding area to evaluateExample eval questions
Inputs provided by the developer and userInstruction following: Does the model accurately understand and act according to the provided instructions? Instruction following: Does the model prioritize the system prompt over a conflicting user prompt?Does the model stay focused on the triage task or get swayed by the user’s question? Assuming the lookup_order call returned, does the order agent return a tracking number and delivery date (doesn’t have to be the correct one)?
Outputs generated by the modelFunctional correctness: Are the model’s outputs are accurate, relevant, and thorough enough to fulfill the intended task or objective?Does the model’s determination of intent correctly match the expected intent? Assuming the lookup_order call returned, does the order agent provide the correct tracking number and delivery date in its response? Does the order agent follow system instructions to ask the customer their reason for requesting a return before processing the return?
Tools chosen by the modelTool selection: Evaluations that test whether the agent is able to select the correct tool to use. Data precision: Evaluations that verify the agent calls the tool with the correct arguments. Typically these arguments are extracted from the conversation history, so the goal is to validate this extraction was correct.Does the order agent correctly call the lookup order tool? Does the order agent correctly call the refund_order tool? Does the order agent call the lookup order tool with the correct order ID? Does the account agent correctly call the reset_password tool with the correct account ID?
Agent handoffAgent handoff accuracy: Evaluations that test whether each agent can appropriately recognize the decision boundary for triaging to another agentWhen a user asks about order status, does the triage agent correctly pass to the order agent? When the user changes the subject to talk about the latest product, does the order agent hand back control to the triage agent?

Create and combine different types of evaluators

创建并组合不同类型的评估器

As you design your own evals, there are several specific evaluator types to choose from. Another way to think about this is what role you want the evaluator to play.

在设计自己的评估时,有几种特定的评估者类型可供选择。另一种思考方式是,你希望评估者扮演什么角色。

Metric-based evals

基于指标的评估

Quantitative evals provide a numerical score you can use to filter and rank results. They provide useful benchmarks for automated regression testing.

定量评估提供一个数值评分,可用于筛选和排名结果。它们为自动化回归测试提供了有用的基准。

  • Examples: Exact match, string match, ROUGE/BLEU scoring, function call accuracy, executable evals (executed to assess functionality or behavior—e.g., text2sql)
  • Challenges: May not be tailored to specific use cases, may miss nuance
  • 示例:精确匹配、字符串匹配、ROUGE/BLEU评分、函数调用准确率、可执行评估(通过执行来评估功能或行为——例如 text2sql)
  • 挑战:可能无法针对特定用例进行定制,可能遗漏细微差异

Human evals

人工评估

Human judgment evals provide the highest quality but are slow and expensive.

人工判断评估质量最高,但速度慢且成本高。

  • Examples: Skim over system outputs to get a sense of whether they look better or worse; create a randomized, blinded test in which employees, contractors, or outsourced labeling agencies judge the quality of system outputs (e.g., ranking a small set of possible outputs, or giving each a grade of 1-5)
  • Challenges: Disagreement among human experts, expensive, slow
  • Recommendations:
    • Conduct multiple rounds of detailed human review to refine the scorecard
      • Implement a “show rather than tell” policy by providing examples of different score levels (e.g., 1, 3, and 8 out of 10)
      • Include a pass/fail threshold in addition to the numerical score
      • A simple way to aggregate multiple reviewers is to take consensus votes
  • 示例:快速浏览系统输出,判断其优劣;建立随机盲测,由员工、承包商或外包标注机构评估系统输出的质量(例如,对少量可能的输出进行排序,或为每个输出打分1-5分)
  • 挑战:人类专家意见不一致,成本高,速度慢
  • 建议
    • 进行多轮详细的人工审查以完善评分卡
      • 实施“展示而非告知”策略,提供不同分数等级的例子(例如,满分10分中的1分、3分和8分)
      • 除数字评分外,还应包含合格/不合格阈值
      • 汇总多个评审结果的简单方法是进行共识投票

LLM-as-a-judge and model graders

LLM 作为裁判与模型评分器

Using models to judge output is cheaper to run and more scalable than human evaluation. Start with gpt-5.5 when you need a strong LLM judge, then validate agreement against your human labels before optimizing for cost or latency.

使用模型来评判输出比人工评估成本更低、更具扩展性。当你需要一个强大的LLM评判器时,可以从gpt-5.5开始,然后对照你的人工标签验证一致性,再针对成本或延迟进行优化。

  • Examples:
    • Pairwise comparison: Present the judge model with two responses and ask it to determine which one is better based on specific criteria
      • Single answer grading: The judge model evaluates a single response in isolation, assigning a score or rating based on predefined quality metrics
      • Reference-guided grading: Provide the judge model with a reference or “gold standard” answer, which it uses as a benchmark to evaluate the given response
  • Challenges: Position bias (response order), verbosity bias (preferring longer responses)
  • Recommendations:
    • Use pairwise comparison or pass/fail for more reliability
      • Use the most capable model to grade if you can. Start with gpt-5.5, then validate whether a specialized reasoning model performs better for your rubric or reference-answer set
      • Control for response lengths as LLMs bias towards longer responses in general
      • Add reasoning and chain-of-thought as reasoning before scoring improves eval performance
      • Once the LLM judge reaches a point where it’s faster, cheaper, and consistently agrees with human annotations, scale up
      • Structure questions to allow for automated grading while maintaining the integrity of the task—a common approach is to reformat questions into multiple choice formats
      • Ensure eval rubrics are clear and detailed
  • 示例
    • 成对比较:向评判模型展示两个回答,要求其根据特定标准判断哪个更好
      • 单答案评分:评判模型单独评估一个回答,基于预定义的质量指标分配分数或评级
      • 参考引导评分:为评判模型提供参考或“黄金标准”答案,以此作为基准来评估给定回答
  • 挑战:位置偏差(回答顺序)、冗长偏差(偏好更长的回答)
  • 建议
    • 使用成对比较或通过/失败来提高可靠性
      • 若条件允许,使用能力最强的模型进行评分。从 gpt-5.5 开始,然后验证专用推理模型是否在你的评分标准或参考答案集上表现更佳
      • 控制回答长度,因为大语言模型普遍偏向于更长的回答
      • 在评分前加入推理和思维链(chain-of-thought),可提升评估性能
      • 一旦大语言模型评判器达到更快、更便宜且与人工标注高度一致时,即可扩大规模
      • 结构化问题,使其既能进行自动评分又能保持任务完整性——常见方法是将问题改写成多项选择格式
      • 确保评估标准清晰且详细

No strategy is perfect. The quality of LLM-as-Judge varies depending on problem context while using expert human annotators to provide ground-truth labels is expensive and time-consuming.

没有策略是完美的。LLM-as-Judge 的质量因问题背景而异,而使用专家人工标注者提供真实标签既昂贵又耗时。

Handle edge cases

处理边缘情况

While your evaluations should cover primary, happy-path scenarios for each architecture, real-world AI systems frequently encounter edge cases that challenge system performance. Evaluating these edge cases is important for ensuring reliability and a good user experience.

虽然你的评估应涵盖每种架构的主要、快乐路径场景,但现实世界中的AI系统经常会遇到挑战系统性能的边缘情况。评估这些边缘情况对于确保可靠性和良好的用户体验至关重要。

We see these edge cases fall into a few buckets:

我们看到这些边界情况分为几个类别:

Input variability

输入变异性

Because users provide input to the model, our system must be flexible to handle the different ways our users may interact, like:

由于用户向模型提供输入,我们的系统必须足够灵活,以处理用户可能采取的不同交互方式,例如:

  • Non-English or multilingual inputs
  • Formats other than input text (e.g., XML, JSON, Markdown, CSV)
  • Input modalities (e.g., images)
  • 非英语或多语言输入
  • 输入文本以外的格式(例如:XML、JSON、Markdown、CSV)
  • 输入模态(例如:图片)

Your evals for instruction following and functional correctness need to accommodate inputs that users might try.

你对指令遵循和功能正确性的评估需要适应用户可能尝试的输入。

Contextual complexity

上下文复杂度

Many LLM-based applications fail due to poor understanding of the context of the request. This context could be from the user or noise in the past conversation history.

许多基于大语言模型的应用因对请求上下文的理解不足而失败。这些上下文可能来自用户或过往对话历史中的噪声。

Examples include:

示例包括:

  • Multiple questions or intents in a single request
  • Typos and misspellings
  • Short requests with minimal context (e.g., if a user just says: “returns”)
  • Long context or long-running conversations
  • Tool calls that return data with ambiguous property names (e.g., "on: 123", where “on” is the order number)
  • Multiple tool calls, sometimes leading to incorrect arguments
  • Multiple agent handoffs, sometimes leading to circular handoffs
  • 单个请求中包含多个问题或意图
  • 拼写与用词错误
  • 上下文极短的请求(例如用户只说了“退货”)
  • 长上下文或长时间持续的对话
  • 返回数据中属性名含歧义的工具调用(例如 "on: 123" 中“on”指订单号)
  • 多次工具调用,有时导致参数错误
  • 多次智能体转交,有时导致循环转交

Personalization and customization

个性化与定制化

While AI improves UX by adapting to user-specific requests, this flexibility introduces many edge cases. Clearly define evals for use cases you want to specifically support and block:

虽然人工智能通过适应用户特定请求来改善用户体验,但这种灵活性会引入许多边缘情况。请为你希望明确支持或阻止的用例定义清晰的评估标准:

  • Jailbreak attempts to get the model to do something different
  • Formatting requests (e.g., format as JSON, or use bullet points)
  • Cases where user prompts conflict with your system prompts
  • 尝试越狱以使模型执行不同的操作
  • 格式请求(例如,格式化为 JSON,或使用项目符号)
  • 用户提示与系统提示冲突的情况

Use evals to improve performance

使用评估来提升性能

When your evals reach a level of maturity that consistently measures performance, shift to using your evals data to improve your application’s performance.

当您的评估成熟到能够持续衡量性能时,请转而利用这些评估数据来改进应用的表现。

Learn more about reinforcement fine-tuning to create a data flywheel.

了解有关强化微调的更多信息,以创建数据飞轮。

Other resources

其他资源

For more inspiration, visit the OpenAI Cookbook, which contains example code and links to third-party resources, or learn more about our tools for evals:

如需更多灵感,请访问 OpenAI Cookbook,其中包含示例代码和第三方资源链接,或进一步了解我们的评估工具: