Skip to content

Result Artifact Contract

AgentV writes each eval invocation as a portable run bundle. The bundle is the source of truth for Dashboard, reports, compare/trend tooling, CI gates, and external adapters.

The contract is run-centric:

  • summary.json owns aggregate run facts.
  • .internal/index.jsonl owns per-run row discovery and filtering.
  • Per-case sidecars own detailed payloads such as grading, metrics, transcripts, generated files, and raw provider evidence.
  • Dashboard, search, SQLite, HTML reports, and vendor exports are rebuildable projections over the bundle.

The default local layout is:

.agentv/results/
.indexes/
runs.jsonl # rebuildable cross-run run catalog
cases.jsonl # rebuildable cross-run case catalog
.cache/
<run_id>/
summary.json
.internal/
index.jsonl # one row per case/result in this run
progress.json
events.jsonl
bundle.json
<case-or-allocation>/
summary.json # optional per-case aggregate, especially repeats
test/ # optional generated test bundle
EVAL.yaml
targets.yaml
files/
graders/
sample-1/
result.json
grading.json
metrics.json
target-execution.json # optional target runtime envelope
stdout.txt # optional captured target stdout
stderr.txt # optional captured target stderr
transcript.json
transcript-raw.jsonl
outputs/
answer.md
file_changes.diff
sample-2/
result.json
grading.json
metrics.json
target-execution.json
stdout.txt
stderr.txt
transcript.json
transcript-raw.jsonl
outputs/
answer.md
file_changes.diff

<run_id> is the only committed run-bundle path identity. It helps AgentV put completed runs somewhere predictable, but readers must not infer semantic truth from folder names. Use fields in summary.json and index.jsonl for experiment, target, variant, attempt, eval path, case identity, timing, scores, and artifact paths.

The run bundle does not add target, model, variant, or cases/ folders below <run_id>. Per-result directories are allocated from row identity, usually with a readable test-id or slug prefix plus a short hash suffix, and remain opaque to consumers.

experiment is metadata: it is how users label a condition such as baseline, candidate, with_skills, or without_skills. It is recorded in summary.json and rows, not as a parent directory and not as a runtime-policy object. If a bundle is copied, combined, published, or imported under a different directory, its metadata still carries the facts consumers should query.

Top-level dot-prefixed directories such as .indexes/ and .cache/ are reserved for rebuildable local state and are skipped by run discovery.

File or fieldOwnsUse it for
summary.jsonAggregate run metadata and rollups: run id, experiment metadata, counts, pass rate, score summaries, duration, token/cost totals, and writer metadata.Listing runs, CI summaries, quick dashboards, trend cards, and validating that a run is complete enough to inspect.
.internal/index.jsonlCanonical per-run row index: one row per case/result aggregate, with identity fields, filter metadata, scores, status, and explicit run-relative paths to sidecars.Filtering, compare/trend inputs, Dashboard detail routing, rerun/resume lookup, export adapters, and artifact discovery.
result.jsonCompact per-attempt manifest for one attempt directory, including AgentV execution_status and verdict.Loading one attempt without scanning the whole run index.
grading.jsonGrader outputs, assertion_results, rubric evidence, execution-metric grader facts, and scoring provenance.Explaining why a row passed or failed.
metrics.jsonDuration, token usage, cost, execution status, trajectory, and derived executor behavior such as tool calls, files touched, shell commands, errors, turns, and output sizes.Dashboard behavior views, cost/latency reporting, metric-style graders, adapter projections, and lightweight analysis.
target-execution.jsonProvider-neutral target runtime envelope, including command, cwd, timeout, exit code or signal, error kind, timestamps, log truncation metadata, and artifact paths.Distinguishing target task failures, target crashes, timeouts, cancellation, malformed provider output, and sandbox/runner failures from AgentV orchestrator failures.
stdout.txt / stderr.txtCaptured target process logs when the runtime exposes them, with truncation metadata recorded in target-execution.json.Debugging crashed, timed-out, cancelled, or malformed target runs without treating raw logs as the canonical row index.
outputs/file_changes.diffFull unified diff of workspace file changes when file changes are captured.Human review and external artifact inspection; LLM and script graders still receive the same full diff through file_changes.
transcript.jsonAgentV-normalized transcript/timeline document with canonical tool_name values and transcript_summary.Portable human review, transcript-aware graders, and tool-trajectory analysis.
transcript-raw.jsonlNative provider or harness evidence when available.Parser debugging, forensic review, and preserving source bytes without making provider schemas public AgentV fields.
test/Generated test bundle for the exact eval slice and target settings that produced a row.Audit, external review, and rerun workflows that should not depend on a mutable source checkout.
artifact_pointersOffload indirection for large detached payload bytes.Finding payloads published outside the primary metadata/control-plane branch, such as transcript bytes on agentv/artifacts/v1.

summary.json and .internal/index.jsonl are complementary, not redundant. A run list should not scan every row just to show pass rate or total duration, and a row reader should not parse aggregate summary structures to find one case’s grading or transcript. Keep aggregate questions on summary.json; keep row and artifact discovery on .internal/index.jsonl.

Each per-attempt grading.json uses assertion_results for the public per-criterion rows. The internal grader API and eval YAML still use assertions; the sidecar converts those rows at the artifact boundary.

{
"score": 0.5,
"verdict": "fail",
"assertion_results": [
{
"text": "Answer cites the changed file",
"passed": true,
"evidence": "The answer cites src/refunds.ts.",
"score": 1,
"verdict": "pass"
},
{
"text": "Tests were updated",
"passed": false,
"evidence": "No test file path or diff was provided.",
"score": 0,
"verdict": "fail"
}
],
"summary": {
"passed": 1,
"failed": 1,
"total": 2,
"pass_rate": 0.5
},
"graders": [
{
"name": "implementation_review",
"type": "llm-rubric",
"score": 0.5,
"verdict": "fail",
"assertion_results": []
}
]
}

score values are normalized to the 0..1 range. verdict is pass, fail, or skip at the artifact level, and pass or fail on individual assertion rows. Evidence stays in grading.json so the sidecar remains useful without loading traces.

Aggregate grading artifacts use the same top-level score and verdict fields. Their score is the mean normalized score across non-execution-error attempts or cases, while verdict reflects the already derived execution status for those quality results instead of recomputing a default threshold.

Each index.jsonl line is a JSON object. The exact field set grows as AgentV adds providers and projections, but stable rows follow these rules:

  • Field names are snake_case.
  • Identity and filter fields live on the row, not only in directory names.
  • Sidecar references are explicit path fields, relative to the run directory.
  • Large detached payloads may also have artifact_pointers, but ordinary sidecars should still be discoverable through path fields.
  • Unknown fields should be preserved by adapters when they rewrite or project rows.

Example row:

{
"timestamp": "2026-06-30T08:15:00.000Z",
"run_id": "2026-06-30T08-15-00-000Z",
"experiment": "with_skills",
"tags": { "experiment": "with_skills", "team": "support" },
"eval_path": "evals/support/refunds.eval.yaml",
"test_id": "refund-eligibility",
"target": "codex-gpt5",
"variant": "skills-v2",
"sample_index": 1,
"retry_index": 0,
"provenance": "native",
"execution_status": "ok",
"score": 0.92,
"duration_ms": 184200,
"result_dir": "refund-eligibility--4f9a7c2d1b6e",
"summary_path": "refund-eligibility--4f9a7c2d1b6e/summary.json",
"grading_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/grading.json",
"metrics_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/metrics.json",
"target_execution_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/target-execution.json",
"stdout_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/stdout.txt",
"stderr_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/stderr.txt",
"target_execution": {
"schema_version": "agentv.target_execution.v1",
"status": "success",
"provider_kind": "cli",
"target_id": "codex-gpt5"
},
"transcript_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/transcript.json",
"transcript_raw_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/transcript-raw.jsonl",
"transcript_summary": {
"total_turns": 4,
"tool_calls": { "file_read": 2, "shell": 1, "unknown": 0 },
"files_read": ["src/refunds.ts"],
"files_modified": ["src/refunds.ts"],
"shell_commands": ["bun test refunds.test.ts"],
"web_fetches": [],
"errors": [],
"thinking_blocks": 1
},
"output_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/outputs/answer.md",
"answer_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/outputs/answer.md",
"file_changes_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/outputs/file_changes.diff",
"test_dir": "refund-eligibility--4f9a7c2d1b6e/test"
}

Rows can represent repeated attempts, multi-target runs, imported suites, manual prepare/grade attempts, or imported provider sessions. That is why experiment, eval_path, test_id, target, variant, sample_index, retry_index, and source metadata belong in index.jsonl: tools can filter dynamically without requiring every run to be pre-split into semantic folders.

When a run resolves a promptfoo-shaped tags map (from suite tags, project config tags, or --tag key=value), the resolved map is emitted as tags on each row and as summary.json.metadata.tags. Its reserved experiment key matches the row experiment field, so trend/compare views can group by tags.experiment.

Use repeat for authoring configuration and samples for produced executions. The sample-1/, sample-2/, and later folders under a result directory are artifact folders for those produced executions. Do not treat those folder names as the comparison dimension. Repeated stochastic samples should be represented by explicit metadata such as sample_index and sample_count; infrastructure retries should use retry metadata such as retry_index, retry_count, and retry_reason when available.

Consumers should read a bundle in this order:

  1. Resolve the run directory from either a directory path or an index.jsonl path.
  2. Load summary.json for aggregate metadata and run-level display.
  3. Stream index.jsonl for row identity, filters, status, scores, and sidecar paths.
  4. Resolve sidecar paths relative to the run directory.
  5. Rebuild any local cache, search index, SQLite table, static report, or vendor projection from summary.json, index.jsonl, and sidecars.

Do not reconstruct paths from suite, name, test_id, target, or directory names. result_dir is readable when possible, but it is still an opaque run-local allocation that may be suffixed or otherwise changed to avoid collisions.

Do not treat derived artifacts as canonical:

  • Dashboard indexes are caches over the run bundle.
  • Search indexes are caches over rows and sidecars.
  • SQLite databases are query accelerators.
  • HTML reports are renderings.
  • Vendor-neutral projection bundles are adapter handoffs.
  • Phoenix, Langfuse, Opik, or other backend views are external projections or correlations, not AgentV’s source of truth.

Run an eval and inspect the portable bundle:

Terminal window
agentv eval evals/support/refunds.eval.yaml --experiment with_skills
ls .agentv/results/<run_id>
cat .agentv/results/<run_id>/summary.json
cat .agentv/results/<run_id>/.internal/index.jsonl

Find failed rows without loading every sidecar:

Terminal window
jq -r 'select(.execution_status != "ok" or .score < 0.5) |
[.eval_path, .test_id, .target, .grading_path] | @tsv' \
.agentv/results/<run_id>/.internal/index.jsonl

Compare two completed runs by their row indexes:

Terminal window
agentv results compare \
.agentv/results/<baseline-run-id>/.internal/index.jsonl \
.agentv/results/<candidate-run-id>/.internal/index.jsonl

Generate a shareable report from the same canonical bundle:

Terminal window
agentv results report .agentv/results/<run_id>

An adapter that exports run results should treat index.jsonl as the row catalog:

import { createReadStream } from "node:fs";
import path from "node:path";
import { createInterface } from "node:readline";
export async function* rows(runDir: string) {
const rl = createInterface({
input: createReadStream(path.join(runDir, ".internal/index.jsonl"), "utf8"),
crlfDelay: Infinity,
});
for await (const line of rl) {
if (!line.trim()) continue;
yield JSON.parse(line) as Record<string, unknown>;
}
}
for await (const row of rows(".agentv/results/2026-run")) {
const gradingPath = row.grading_path;
if (typeof gradingPath === "string") {
console.log(path.join(".agentv/results/2026-run", gradingPath));
}
}

Adapter guidance:

  • Preserve unknown row fields when possible.
  • Prefer path fields such as grading_path, metrics_path, transcript_path, and transcript_raw_path over ad hoc path construction.
  • Use artifact_pointers only for detached payload lookup; do not make pointers the discovery path for ordinary sidecars that are present in the run tree.
  • If you build a database or search index, store enough source metadata to rebuild it from the run bundle and invalidate it when summary.json or index.jsonl changes.
  • Keep backend-specific anonymization, upload, and schema mapping in the adapter layer. AgentV’s canonical bundle remains backend-neutral.