Result Artifact Contract

AgentV writes each eval invocation as a portable run bundle. The bundle is the source of truth for Dashboard, reports, compare/trend tooling, CI gates, and external adapters.

The contract is run-centric:

summary.json owns aggregate run facts.
.internal/index.jsonl owns per-run row discovery and filtering.
Per-case sidecars own detailed payloads such as grading, metrics, transcripts, generated files, and raw provider evidence.
Dashboard, search, SQLite, HTML reports, and vendor exports are rebuildable projections over the bundle.

Directory Layout

The default local layout is:

.agentv/results/
  .indexes/
    runs.jsonl               # rebuildable cross-run run catalog
    cases.jsonl              # rebuildable cross-run case catalog
  .cache/
  <run_id>/
    summary.json
    .internal/
      index.jsonl             # one row per case/result in this run
      progress.json
      events.jsonl
      bundle.json
    <case-or-allocation>/
      summary.json            # optional per-case aggregate, especially repeats
      test/                   # optional generated test bundle
        EVAL.yaml
        targets.yaml
        files/
        graders/
      sample-1/
        result.json
        grading.json
        metrics.json
        target-execution.json     # optional target runtime envelope
        stdout.txt                # optional captured target stdout
        stderr.txt                # optional captured target stderr
        transcript.json
        transcript-raw.jsonl
        outputs/
          answer.md
          file_changes.diff
      sample-2/
        result.json
        grading.json
        metrics.json
        target-execution.json
        stdout.txt
        stderr.txt
        transcript.json
        transcript-raw.jsonl
        outputs/
          answer.md
          file_changes.diff

<run_id> is the only committed run-bundle path identity. It helps AgentV put completed runs somewhere predictable, but readers must not infer semantic truth from folder names. Use fields in summary.json and index.jsonl for experiment, target, variant, attempt, eval path, case identity, timing, scores, and artifact paths.

The run bundle does not add target, model, variant, or cases/ folders below <run_id>. Per-result directories are allocated from row identity, usually with a readable test-id or slug prefix plus a short hash suffix, and remain opaque to consumers.

experiment is metadata: it is how users label a condition such as baseline, candidate, with_skills, or without_skills. It is recorded in summary.json and rows, not as a parent directory and not as a runtime-policy object. If a bundle is copied, combined, published, or imported under a different directory, its metadata still carries the facts consumers should query.

Top-level dot-prefixed directories such as .indexes/ and .cache/ are reserved for rebuildable local state and are skipped by run discovery.

File Roles

File or field	Owns	Use it for
`summary.json`	Aggregate run metadata and rollups: run id, experiment metadata, counts, pass rate, score summaries, duration, token/cost totals, and writer metadata.	Listing runs, CI summaries, quick dashboards, trend cards, and validating that a run is complete enough to inspect.
`.internal/index.jsonl`	Canonical per-run row index: one row per case/result aggregate, with identity fields, filter metadata, scores, status, and explicit run-relative paths to sidecars.	Filtering, compare/trend inputs, Dashboard detail routing, rerun/resume lookup, export adapters, and artifact discovery.
`result.json`	Compact per-attempt manifest for one attempt directory, including AgentV `execution_status` and `verdict`.	Loading one attempt without scanning the whole run index.
`grading.json`	Grader outputs, `assertion_results`, rubric evidence, execution-metric grader facts, and scoring provenance.	Explaining why a row passed or failed.
`metrics.json`	Duration, token usage, cost, execution status, trajectory, and derived executor behavior such as tool calls, files touched, shell commands, errors, turns, and output sizes.	Dashboard behavior views, cost/latency reporting, metric-style graders, adapter projections, and lightweight analysis.
`target-execution.json`	Provider-neutral target runtime envelope, including command, cwd, timeout, exit code or signal, error kind, timestamps, log truncation metadata, and artifact paths.	Distinguishing target task failures, target crashes, timeouts, cancellation, malformed provider output, and sandbox/runner failures from AgentV orchestrator failures.
`stdout.txt` / `stderr.txt`	Captured target process logs when the runtime exposes them, with truncation metadata recorded in `target-execution.json`.	Debugging crashed, timed-out, cancelled, or malformed target runs without treating raw logs as the canonical row index.
`outputs/file_changes.diff`	Full unified diff of workspace file changes when file changes are captured.	Human review and external artifact inspection; LLM and script graders still receive the same full diff through `file_changes`.
`transcript.json`	AgentV-normalized transcript/timeline document with canonical `tool_name` values and `transcript_summary`.	Portable human review, transcript-aware graders, and tool-trajectory analysis.
`transcript-raw.jsonl`	Native provider or harness evidence when available.	Parser debugging, forensic review, and preserving source bytes without making provider schemas public AgentV fields.
`test/`	Generated test bundle for the exact eval slice and target settings that produced a row.	Audit, external review, and rerun workflows that should not depend on a mutable source checkout.
`artifact_pointers`	Offload indirection for large detached payload bytes.	Finding payloads published outside the primary metadata/control-plane branch, such as transcript bytes on `agentv/artifacts/v1`.

summary.json and .internal/index.jsonl are complementary, not redundant. A run list should not scan every row just to show pass rate or total duration, and a row reader should not parse aggregate summary structures to find one case’s grading or transcript. Keep aggregate questions on summary.json; keep row and artifact discovery on .internal/index.jsonl.

Grading Contract

Each per-attempt grading.json uses assertion_results for the public per-criterion rows. The internal grader API and eval YAML still use assertions; the sidecar converts those rows at the artifact boundary.

{
  "score": 0.5,
  "verdict": "fail",
  "assertion_results": [
    {
      "text": "Answer cites the changed file",
      "passed": true,
      "evidence": "The answer cites src/refunds.ts.",
      "score": 1,
      "verdict": "pass"
    },
    {
      "text": "Tests were updated",
      "passed": false,
      "evidence": "No test file path or diff was provided.",
      "score": 0,
      "verdict": "fail"
    }
  ],
  "summary": {
    "passed": 1,
    "failed": 1,
    "total": 2,
    "pass_rate": 0.5
  },
  "graders": [
    {
      "name": "implementation_review",
      "type": "llm-rubric",
      "score": 0.5,
      "verdict": "fail",
      "assertion_results": []
    }
  ]
}

score values are normalized to the 0..1 range. verdict is pass, fail, or skip at the artifact level, and pass or fail on individual assertion rows. Evidence stays in grading.json so the sidecar remains useful without loading traces.

Aggregate grading artifacts use the same top-level score and verdict fields. Their score is the mean normalized score across non-execution-error attempts or cases, while verdict reflects the already derived execution status for those quality results instead of recomputing a default threshold.

Row Contract

Each index.jsonl line is a JSON object. The exact field set grows as AgentV adds providers and projections, but stable rows follow these rules:

Field names are snake_case.
Identity and filter fields live on the row, not only in directory names.
Sidecar references are explicit path fields, relative to the run directory.
Large detached payloads may also have artifact_pointers, but ordinary sidecars should still be discoverable through path fields.
Unknown fields should be preserved by adapters when they rewrite or project rows.

Example row:

{
  "timestamp": "2026-06-30T08:15:00.000Z",
  "run_id": "2026-06-30T08-15-00-000Z",
  "experiment": "with_skills",
  "tags": { "experiment": "with_skills", "team": "support" },
  "eval_path": "evals/support/refunds.eval.yaml",
  "test_id": "refund-eligibility",
  "target": "codex-gpt5",
  "variant": "skills-v2",
  "sample_index": 1,
  "retry_index": 0,
  "provenance": "native",
  "execution_status": "ok",
  "score": 0.92,
  "duration_ms": 184200,
  "result_dir": "refund-eligibility--4f9a7c2d1b6e",
  "summary_path": "refund-eligibility--4f9a7c2d1b6e/summary.json",
  "grading_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/grading.json",
  "metrics_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/metrics.json",
  "target_execution_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/target-execution.json",
  "stdout_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/stdout.txt",
  "stderr_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/stderr.txt",
  "target_execution": {
    "schema_version": "agentv.target_execution.v1",
    "status": "success",
    "provider_kind": "cli",
    "target_id": "codex-gpt5"
  },
  "transcript_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/transcript.json",
  "transcript_raw_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/transcript-raw.jsonl",
  "transcript_summary": {
    "total_turns": 4,
    "tool_calls": { "file_read": 2, "shell": 1, "unknown": 0 },
    "files_read": ["src/refunds.ts"],
    "files_modified": ["src/refunds.ts"],
    "shell_commands": ["bun test refunds.test.ts"],
    "web_fetches": [],
    "errors": [],
    "thinking_blocks": 1
  },
  "output_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/outputs/answer.md",
  "answer_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/outputs/answer.md",
  "file_changes_path": "refund-eligibility--4f9a7c2d1b6e/sample-1/outputs/file_changes.diff",
  "test_dir": "refund-eligibility--4f9a7c2d1b6e/test"
}

Rows can represent repeated attempts, multi-target runs, imported suites, manual prepare/grade attempts, or imported provider sessions. That is why experiment, eval_path, test_id, target, variant, sample_index, retry_index, and source metadata belong in index.jsonl: tools can filter dynamically without requiring every run to be pre-split into semantic folders.

When a run resolves a promptfoo-shaped tags map (from suite tags, project config tags, or --tag key=value), the resolved map is emitted as tags on each row and as summary.json.metadata.tags. Its reserved experiment key matches the row experiment field, so trend/compare views can group by tags.experiment.

Use repeat for authoring configuration and samples for produced executions. The sample-1/, sample-2/, and later folders under a result directory are artifact folders for those produced executions. Do not treat those folder names as the comparison dimension. Repeated stochastic samples should be represented by explicit metadata such as sample_index and sample_count; infrastructure retries should use retry metadata such as retry_index, retry_count, and retry_reason when available.

Reader Rules

Consumers should read a bundle in this order:

Resolve the run directory from either a directory path or an index.jsonl path.
Load summary.json for aggregate metadata and run-level display.
Stream index.jsonl for row identity, filters, status, scores, and sidecar paths.
Resolve sidecar paths relative to the run directory.
Rebuild any local cache, search index, SQLite table, static report, or vendor projection from summary.json, index.jsonl, and sidecars.

Do not reconstruct paths from suite, name, test_id, target, or directory names. result_dir is readable when possible, but it is still an opaque run-local allocation that may be suffixed or otherwise changed to avoid collisions.

Do not treat derived artifacts as canonical:

Dashboard indexes are caches over the run bundle.
Search indexes are caches over rows and sidecars.
SQLite databases are query accelerators.
HTML reports are renderings.
Vendor-neutral projection bundles are adapter handoffs.
Phoenix, Langfuse, Opik, or other backend views are external projections or correlations, not AgentV’s source of truth.

User Examples

Run an eval and inspect the portable bundle:

agentv eval evals/support/refunds.eval.yaml --experiment with_skills
ls .agentv/results/<run_id>
cat .agentv/results/<run_id>/summary.json
cat .agentv/results/<run_id>/.internal/index.jsonl

Find failed rows without loading every sidecar:

jq -r 'select(.execution_status != "ok" or .score < 0.5) |
  [.eval_path, .test_id, .target, .grading_path] | @tsv' \
  .agentv/results/<run_id>/.internal/index.jsonl

Compare two completed runs by their row indexes:

agentv results compare \
  .agentv/results/<baseline-run-id>/.internal/index.jsonl \
  .agentv/results/<candidate-run-id>/.internal/index.jsonl

Generate a shareable report from the same canonical bundle:

agentv results report .agentv/results/<run_id>

Integration Author Examples

An adapter that exports run results should treat index.jsonl as the row catalog:

import { createReadStream } from "node:fs";
import path from "node:path";
import { createInterface } from "node:readline";

export async function* rows(runDir: string) {
  const rl = createInterface({
    input: createReadStream(path.join(runDir, ".internal/index.jsonl"), "utf8"),
    crlfDelay: Infinity,
  });

  for await (const line of rl) {
    if (!line.trim()) continue;
    yield JSON.parse(line) as Record<string, unknown>;
  }
}

for await (const row of rows(".agentv/results/2026-run")) {
  const gradingPath = row.grading_path;
  if (typeof gradingPath === "string") {
    console.log(path.join(".agentv/results/2026-run", gradingPath));
  }
}

Adapter guidance:

Preserve unknown row fields when possible.
Prefer path fields such as grading_path, metrics_path, transcript_path, and transcript_raw_path over ad hoc path construction.
Use artifact_pointers only for detached payload lookup; do not make pointers the discovery path for ordinary sidecars that are present in the run tree.
If you build a database or search index, store enough source metadata to rebuild it from the run bundle and invalidate it when summary.json or index.jsonl changes.
Keep backend-specific anonymization, upload, and schema mapping in the adapter layer. AgentV’s canonical bundle remains backend-neutral.