Guide

2026-06-19

OpenAI Evals API Reference 2026: Endpoints, Graders, Runs

A canonical OpenAI Evals API reference: creating evals, eval runs, data source configs, graders, sampling, reading results, and the oaieval CLI.

OpenAI Evals API Reference 2026

This page is a reference for the OpenAI Evals framework and the Evals API as of 2026. It documents the objects, endpoints, parameters, grader types, and command-line entry points used to define, run, and read evaluations of model and prompt behavior. It is organized for lookup rather than narrative reading: each section describes one object or operation, lists its parameters, and shows a minimal runnable example using the openai Python client.

The Evals API consists of two top-level resources: the eval, which is a reusable definition of a dataset schema plus one or more testing criteria (graders), and the eval run, which executes that definition against a data source and produces graded results. A third surface, the oaieval command-line tool bundled with the open-source openai/evals repository, provides a registry-based way to run evals without writing API calls. All three are covered below.

For conceptual background and a tutorial-style walkthrough, see the OpenAI Evals complete guide. This page assumes you already understand what an eval is and focuses on the exact shape of the API. Readers comparing frameworks may also reference our Ragas RAG evaluation metrics guide and the DeepEval vs Ragas comparison.

Authentication and Client Setup

All API calls require an API key supplied through the OPENAI_API_KEY environment variable or passed explicitly to the client constructor. The Evals API is accessed through the client.evals namespace of the official Python SDK.

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# The evals namespace exposes evals and nested runs.
client.evals          # eval objects
client.evals.runs     # eval run objects

The base REST path is /v1/evals. Runs are nested under an eval at /v1/evals/{eval_id}/runs. Every object returned includes an id, an object type string, and a created_at Unix timestamp.

The Eval Object

An eval defines the schema of the data your test items will provide and the testing criteria used to grade model output. It does not itself contain data or invoke a model; that happens at run time. Creating an eval registers a reusable template.

client.evals.create()

Parameter	Type	Required	Description
name	string	No	Human-readable label for the eval.
data_source_config	object	Yes	Declares the schema of each test item, including item fields and optional sampling output schema.
testing_criteria	array	Yes	One or more grader objects applied to each sampled item.
metadata	map	No	Up to 16 key-value string pairs for your own bookkeeping.

The data_source_config has a type of either custom (you define an item_schema JSON Schema describing each row) or logs (the source is stored completion logs). The include_sample_schema flag, when true, tells the API that runs will sample model output to be graded, exposing a sample namespace to graders.

eval_obj = client.evals.create(
    name="qa-accuracy",
    data_source_config={
        "type": "custom",
        "item_schema": {
            "type": "object",
            "properties": {
                "question": {"type": "string"},
                "reference": {"type": "string"},
            },
            "required": ["question", "reference"],
        },
        "include_sample_schema": True,
    },
    testing_criteria=[
        {
            "type": "string_check",
            "name": "exact-match",
            "input": "{{ sample.output_text }}",
            "reference": "{{ item.reference }}",
            "operation": "eq",
        }
    ],
)

print(eval_obj.id)  # e.g. "eval_68a..."

Template strings use double-brace syntax. {{ item.<field> }} references a field from a test item; {{ sample.output_text }} references the model output produced during the run.

Other eval endpoints

Operation	Method	Path	Notes
Create eval	POST	/v1/evals	Returns the eval object.
Retrieve eval	GET	/v1/evals/{eval_id}	Fetch a single eval by id.
List evals	GET	/v1/evals	Supports order, limit, after pagination.
Update eval	POST	/v1/evals/{eval_id}	Update name or metadata.
Delete eval	DELETE	/v1/evals/{eval_id}	Removes the eval and its runs.

client.evals.retrieve(eval_obj.id)
client.evals.list(limit=20, order="desc")
client.evals.update(eval_obj.id, metadata={"team": "search"})
client.evals.delete(eval_obj.id)

The Eval Run Object

An eval run executes an eval against a concrete data source. The run supplies the rows, optionally samples model completions for each row, applies the testing criteria, and aggregates pass/fail counts and per-criterion scores. A single eval can have many runs, which is how you compare prompts, models, or dataset versions over time.

client.evals.runs.create()

Parameter	Type	Required	Description
eval_id	string	Yes (path)	The id of the parent eval.
name	string	No	Label for this run.
data_source	object	Yes	Where rows come from and how output is sampled.
metadata	map	No	Key-value pairs for bookkeeping.

The data_source object has a type that is usually completions or responses, meaning the API will call the named model to generate the output to be graded. It also carries a source describing the rows and a model plus input_messages template describing how to build each prompt.

run = client.evals.runs.create(
    eval_id=eval_obj.id,
    name="gpt-5-baseline",
    data_source={
        "type": "completions",
        "model": "gpt-5",
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "role": "developer",
                    "content": "Answer the question concisely and factually.",
                },
                {"role": "user", "content": "{{ item.question }}"},
            ],
        },
        "source": {
            "type": "file_content",
            "content": [
                {"item": {"question": "Capital of France?", "reference": "Paris"}},
                {"item": {"question": "2 + 2?", "reference": "4"}},
            ],
        },
    },
)

print(run.id, run.status)  # "evalrun_...", "queued"

Run lifecycle and endpoints

A run progresses through statuses: queued, in_progress, completed, failed, or canceled. Poll the run or list its output items to read results once it completes.

Operation	Method	Path
Create run	POST	/v1/evals/{eval_id}/runs
Retrieve run	GET	/v1/evals/{eval_id}/runs/{run_id}
List runs	GET	/v1/evals/{eval_id}/runs
Cancel run	POST	/v1/evals/{eval_id}/runs/{run_id}
List output items	GET	/v1/evals/{eval_id}/runs/{run_id}/output_items
Retrieve output item	GET	/v1/evals/{eval_id}/runs/{run_id}/output_items/{id}

Data Source Configuration Reference

The source inside a run's data_source declares where rows come from. Three source types are supported.

Source type	Description	Key fields
file_content	Inline rows passed directly in the request.	content (array of {item: {...}})
file_id	Rows from a previously uploaded JSONL file.	id (uploaded file id)
stored_completions	Rows derived from logged completions.	metadata filters, limit

To use an uploaded file, first upload a JSONL file where each line is a JSON object with an item key, then reference its id.

file = client.files.create(
    file=open("golden.jsonl", "rb"),
    purpose="evals",
)

run = client.evals.runs.create(
    eval_id=eval_obj.id,
    name="from-file",
    data_source={
        "type": "completions",
        "model": "gpt-5-mini",
        "input_messages": {
            "type": "template",
            "template": [
                {"role": "user", "content": "{{ item.question }}"},
            ],
        },
        "source": {"type": "file_id", "id": file.id},
    },
)

Each line of golden.jsonl looks like:

{"item": {"question": "Capital of Japan?", "reference": "Tokyo"}}

Testing Criteria: Grader Reference

Testing criteria are the graders applied to each item. The Evals API supports four grader types. Every grader has a type, a name, and type-specific parameters. Graders that produce a numeric score also accept a pass_threshold.

Grader type	type value	Use case	Output
String check	string_check	Deterministic exact or substring comparison.	pass/fail
Text similarity	text_similarity	Fuzzy closeness to a reference string.	score + pass/fail
Model grader	label_model / score_model	LLM-as-judge classification or scoring.	label or score
Python grader	python	Arbitrary custom logic in sandboxed Python.	score + pass/fail

String check grader

Compares two templated strings with a deterministic operation. Supported operations are eq, ne, like, and ilike.

string_grader = {
    "type": "string_check",
    "name": "contains-answer",
    "input": "{{ sample.output_text }}",
    "reference": "{{ item.reference }}",
    "operation": "ilike",
}

Text similarity grader

Scores how close the output is to a reference using a chosen evaluation metric such as fuzzy_match, bleu, rouge_l, or cosine. A pass_threshold converts the score into pass/fail.

similarity_grader = {
    "type": "text_similarity",
    "name": "rouge-overlap",
    "input": "{{ sample.output_text }}",
    "reference": "{{ item.reference }}",
    "evaluation_metric": "rouge_l",
    "pass_threshold": 0.6,
}

Model grader (label and score)

A model grader uses an evaluator LLM to judge the output. The label_model variant classifies output into one of a fixed set of labels and passes when the chosen label is in passing_labels. The score_model variant returns a numeric score compared against a threshold.

label_grader = {
    "type": "label_model",
    "name": "is-correct",
    "model": "gpt-4o",
    "input": [
        {
            "role": "developer",
            "content": "Classify whether the answer is correct given the reference.",
        },
        {
            "role": "user",
            "content": "Answer: {{ sample.output_text }}\\nReference: {{ item.reference }}",
        },
    ],
    "labels": ["correct", "incorrect"],
    "passing_labels": ["correct"],
}

Python grader

A Python grader runs sandboxed code that receives the item and sample and returns a numeric result. It must define a function named grade taking sample and item arguments.

python_grader = {
    "type": "python",
    "name": "length-penalty",
    "source": (
        "def grade(sample, item):\\n"
        "    out = sample['output_text']\\n"
        "    ref = item['reference']\\n"
        "    if ref.lower() not in out.lower():\\n"
        "        return 0.0\\n"
        "    # Penalize verbosity beyond 200 chars.\\n"
        "    return 1.0 if len(out) <= 200 else 0.5\\n"
    ),
    "pass_threshold": 0.75,
}

Sampling Configuration

When a run's data source has type completions or responses, the API samples model output for each row before grading. Sampling parameters control the generation.

Field	Type	Description
model	string	Model id used to generate the output to be graded.
input_messages	object	Template that builds the prompt from item fields.
sampling_params	object	Optional generation controls: temperature, max_completion_tokens, top_p, seed.

data_source = {
    "type": "completions",
    "model": "gpt-5",
    "input_messages": {
        "type": "template",
        "template": [{"role": "user", "content": "{{ item.question }}"}],
    },
    "sampling_params": {
        "temperature": 0.0,
        "max_completion_tokens": 256,
        "seed": 42,
    },
    "source": {"type": "file_id", "id": "file_abc123"},
}

Set temperature to 0.0 and pin a seed for reproducible runs, which is essential when comparing model versions.

Reading Results

After a run completes, read aggregate counts from the run object and per-item detail from the output items endpoint. The run object exposes result_counts with passed, failed, errored, and total, plus per_testing_criteria_results breaking pass/fail down by grader.

import time

# Poll until the run finishes.
while True:
    run = client.evals.runs.retrieve(eval_id=eval_obj.id, run_id=run.id)
    if run.status in ("completed", "failed", "canceled"):
        break
    time.sleep(5)

print(run.result_counts)
# {"total": 100, "passed": 87, "failed": 12, "errored": 1}

for crit in run.per_testing_criteria_results:
    print(crit.testing_criteria, crit.passed, crit.failed)

# Per-item detail, including the sampled output and each grader's verdict.
items = client.evals.runs.output_items.list(
    eval_id=eval_obj.id,
    run_id=run.id,
    limit=50,
)
for item in items.data:
    print(item.id, item.status, item.results)

Each output item contains the original datasource_item, the sampled completion, and a results array with one entry per testing criterion, where each entry reports passed, score, and the grader name.

The oaieval CLI and openai-python

The open-source openai/evals repository provides a registry-based workflow as an alternative to the API. Evals are defined as YAML files in a registry, and the oaieval command runs a named eval against a named model. This path predates the hosted Evals API and remains useful for local, reproducible, file-based eval definitions.

Install the package and run an eval from the registry:

pip install evals
oaieval gpt-4o my-eval-name --max_samples 100 --record_path results.jsonl

A registry eval is described by a YAML file referencing a dataset and a grading class.

# evals/registry/evals/my_eval.yaml
my-eval-name:
  id: my-eval-name.dev.v0
  metrics: [accuracy]
my-eval-name.dev.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: my_eval/samples.jsonl

Each line of samples.jsonl provides the prompt and the ideal answer:

{"input": [{"role": "user", "content": "Capital of France?"}], "ideal": "Paris"}

oaieval flag	Description
--max_samples N	Limit the run to the first N samples.
--record_path PATH	Write per-sample records to a JSONL file.
--seed N	Set the sampling seed for reproducibility.
--completion_args	Pass model parameters such as temperature.

Choose the hosted Evals API when you want runs stored, graded, and compared in the platform with managed sampling; choose oaieval when you want fully local, version-controlled eval definitions in your own repository.

Error Handling and Status Reference

Both eval creation and run execution can fail, and the API surfaces failures at two levels: request-level errors returned synchronously, and item-level errors recorded inside a completed run. Understanding the difference prevents you from treating a partial failure as a total one.

A request-level error is returned immediately when the eval or run object is malformed. Common causes include a testing_criteria template that references a field absent from the item_schema, an invalid grader type, or a missing required parameter. These raise an exception in the SDK before any sampling occurs.

from openai import BadRequestError

try:
    run = client.evals.runs.create(
        eval_id=eval_obj.id,
        data_source={
            "type": "completions",
            "model": "gpt-5",
            "input_messages": {
                "type": "template",
                "template": [{"role": "user", "content": "{{ item.missing }}"}],
            },
            "source": {"type": "file_id", "id": "file_abc123"},
        },
    )
except BadRequestError as exc:
    print("Run rejected:", exc.message)

Item-level errors occur when an individual row fails during sampling or grading, for example a context-length overflow on one oversized prompt. The run still reaches completed status, but result_counts.errored is non-zero and the affected output items carry an error payload. Always inspect errored before trusting an aggregate pass rate.

Run status	Meaning	Next action
queued	Accepted, not yet started.	Poll until in_progress.
in_progress	Sampling and grading underway.	Continue polling.
completed	All items processed (some may have errored).	Read result_counts and output items.
failed	The run could not execute.	Inspect the error field on the run.
canceled	Canceled via the cancel endpoint.	None.

Pagination and Listing Parameters

The list endpoints for evals, runs, and output items share a common cursor-based pagination scheme. Rather than page numbers, you pass the id of the last item you saw as the after cursor to fetch the next page. This keeps listings stable even as new objects are created.

Parameter	Type	Default	Description
limit	integer	20	Number of objects to return, 1 to 100.
order	string	asc	Sort by created_at, asc or desc.
after	string	none	Cursor: return objects after this id.
status	string	none	Filter runs or items by status.

# Iterate every run of an eval, page by page.
all_runs = []
cursor = None
while True:
    page = client.evals.runs.list(
        eval_id=eval_obj.id,
        limit=100,
        order="asc",
        after=cursor,
    )
    all_runs.extend(page.data)
    if not page.has_more:
        break
    cursor = page.data[-1].id

print(f"Total runs: {len(all_runs)}")

The same pattern applies to client.evals.list() and client.evals.runs.output_items.list(). The status filter on the output-items list is particularly useful for pulling only the failing items from a large run so you can triage regressions without downloading every passing row.

Frequently Asked Questions

What is the difference between an eval and an eval run in the OpenAI Evals API?

An eval is a reusable definition: it declares the schema of each test item through a data_source_config and the testing_criteria used to grade output. An eval run executes that definition against a concrete data source, samples model completions, applies the graders, and produces pass/fail counts. One eval can have many runs, which is how you compare prompts and models over time.

Which grader type should I use for exact-answer questions?

Use the string_check grader with the eq or ilike operation when the correct answer is a deterministic string. For answers that are correct but phrased differently, use text_similarity with a metric like rouge_l or fuzzy_match and a pass_threshold. For open-ended correctness that needs judgment, use a label_model grader with passing_labels.

How do I make eval runs reproducible?

Set sampling_params.temperature to 0.0 and pin a seed in the data source, and pin the exact model version rather than a floating alias. With deterministic sampling, the same dataset produces stable output, so any change in scores reflects a real change in the prompt or model rather than sampling noise.

Can I run OpenAI Evals without the API using a file-based workflow?

Yes. The open-source openai/evals repository provides the oaieval CLI, which runs registry-defined YAML evals against a named model entirely locally. This is ideal when you want version-controlled eval definitions in your own repository. The hosted Evals API is preferable when you want runs stored and compared in the platform with managed sampling.

How do I read per-item results from a completed run?

Poll the run with client.evals.runs.retrieve until status is completed, then read run.result_counts for aggregates and run.per_testing_criteria_results for a per-grader breakdown. For full per-item detail, call the output_items list endpoint, which returns each item's sampled output and a results array with the verdict from every testing criterion.

What data source types does an eval run support?

The source inside a run's data_source can be file_content for inline rows, file_id for a previously uploaded JSONL file where each line has an item key, or stored_completions for rows derived from logged completions. The surrounding data_source type is typically completions or responses, which tells the API to sample the named model for each row.

How does a Python grader receive the model output?

A Python grader defines a function named grade that takes sample and item arguments. The sample dictionary contains the model output, accessible as sample['output_text'], and item contains the original test fields such as item['reference']. The function returns a numeric score that is compared against the grader's pass_threshold to yield pass or fail.

Can I use model graders with a different model than the one being evaluated?

Yes, and it is recommended. The model field inside a label_model or score_model grader specifies the judge model, which is independent of the model under test set in the run's data_source. A common pattern is to sample output from a smaller or candidate model while grading with a strong judge such as gpt-4o for reliable verdicts.

Conclusion

The OpenAI Evals API organizes evaluation into two objects, evals and runs, graded by four grader types, and complemented by the local oaieval CLI. Use this reference to look up the exact parameters for creating evals, configuring data sources, choosing graders, controlling sampling, and reading results, then pin your models and seeds for reproducible comparisons across versions.

To go deeper, read the OpenAI Evals complete guide for a full tutorial, and browse the skills directory for ready-made evaluation workflows you can drop into your own agents and CI pipelines.