Evaluations

Evaluations are regression datasets for agents. Each set contains cases with inputs, expected output, rubrics, weights, and metadata. Runs execute against the latest published evaluation version unless you explicitly pass a version ID.

Evaluations

Create versioned evaluation sets, run them against agents, and track score, pass rate, and CT cost.

Evaluation model

An evaluation set is a versioned dataset. Publishing a version makes that dataset the default for future runs, which keeps unsaved editor changes out of production-quality evaluation results.

Use cases for prompts, expected outputs, rubrics, weights, and structured metadata.

Publish the dataset version that should be used by scheduled jobs, CI checks, and fine-tuning jobs.

Runs return average score, pass rate, status, linked agent, linked computer, version, and CT cost.

Creating a new evaluation version should reset charts and run analytics for that version.

Create, publish, and run evaluations

Create a set, save and publish a dataset version, then run the published dataset against an agent and computer.

JavaScript uses client.evaluations.create(), createVersion(), publishVersion(), run(), listRuns(), and getRun().

Python uses client.evaluations.create(), create_version(), publish_version(), run(), list_runs(), and get_run().

Direct HTTP uses /v1/evaluations, /v1/evaluations/{id}/versions, and /v1/evaluations/{id}/runs.

Run an evaluation

Evaluations over HTTP