Retry-safe eval workers with stable row IDs

Applies to:

Plan -
Deployment -

Summary

Goal: Make interrupted eval workers resume onto the same experiment rows instead of appending duplicates. Features: upsert_id (TypeScript Eval), experiment.log() with stable id, _object_delete insert flag, attemptId metadata tagging.

Configuration steps

Step 1: Set a stable row ID in TypeScript Eval

Use upsert_id on each EvalCase, derived from a value that is stable across retries (e.g., sample ID or input hash).

data: examples.map((example) => ({
  input: example.input,
  expected: example.expected,
  metadata: { sample_id: example.id },
  upsert_id: `eval-row-${example.id}`,
}))

Retries that supply the same upsert_id land on the same logical row in the UI.

Step 2: Set a stable row ID when logging manually

If you are using lower-level experiment logging instead of the Eval API, pass the stable key as id.

experiment.log({
  id: `eval-row-${example.id}`,
  input: example.input,
  output: result,
  scores: { ... },
})

This applies to both TypeScript and Python when calling experiment.log() directly.

Step 3: Understand Python Eval limitations

Eval() in Python does not expose an equivalent upsert_id field on EvalCase. Options:

Use lower-level experiment.log() with a stable id (see Step 2).
Implement resume logic in your worker to skip already-completed cases before starting the eval.

Step 4: Tag every span with a unique `attemptId`

Stable row IDs do not automatically remove child spans from a prior interrupted attempt. Tag every span (root and children) with an attempt-scoped value so stale spans can be identified later.

const attemptId = workerRunId; // unique per worker restart

// Root span
rootSpan.log({
  metadata: { evalId, iteration, attemptId },
})

// Each child span must also include attemptId
const llmSpan = rootSpan.startSpan({ name: "llm_call" });
llmSpan.log({
  input: prompt,
  output: response,
  metadata: { attemptId, model: "gpt-4" },
})

const toolSpan = rootSpan.startSpan({ name: "tool_execution" });
toolSpan.log({
  input: toolArgs,
  output: toolResult,
  metadata: { attemptId, tool: "search" },
})

Step 5: Delete stale child spans from a prior attempt

Query for spans with the old attemptId, then delete them via the insert endpoint.

POST /v1/experiment/{experiment_id}/insert

{
  "events": [
    { "id": "stale-child-span-id", "_object_delete": true },
    { "id": "another-stale-span-id", "_object_delete": true }
  ]
}

Repeat for each stale child span from the interrupted attempt.

Step 6: Alternative — use a fresh experiment per attempt

If managing stale span cleanup is too complex, create a new experiment for each retry attempt and designate the final successful experiment as canonical. This avoids orphan span cleanup entirely.

Behavior notes

Experiment inserts are append-only. Reusing a stable id does not rewrite history; it creates a new version. The UI displays the latest version per row.
Reusing a stable row id on retry does not delete child spans from the previous attempt. Orphan spans must be removed explicitly using _object_delete.
Do not manually set span_id or root_span_id as the primary deduplication strategy. The high-level Eval runner manages these internally, and overriding them can produce rows with multiple traces.

​Summary

​Configuration steps

​Step 1: Set a stable row ID in TypeScript Eval

​Step 2: Set a stable row ID when logging manually

​Step 3: Understand Python Eval limitations

​Step 4: Tag every span with a unique attemptId

​Step 5: Delete stale child spans from a prior attempt

​Step 6: Alternative — use a fresh experiment per attempt

​Behavior notes