Datasets

Datasets are versioned collections of test cases that you use to run evaluations and track improvements over time. Build datasets from production logs, user feedback, manual curation, or generate them with Loop. Key advantages:

Versioned: Every change is tracked, so experiments can pin to specific versions.
Integrated: Use directly in evaluations and populate from production.
Scalable: Stored in a modern data warehouse without storage limits.

Dataset structure

Each record has four top-level fields:

input: Data to recreate the example in your application (required).
expected: Ideal output or ground truth (optional but recommended for evaluation).
metadata: Key-value pairs for filtering and grouping (optional).
tags: Labels for organizing and filtering records (optional).

Where to go from here

Create datasets from uploads, the SDK, production logs, user feedback, traces, or Loop.
Build dataset pipelines to transform project logs into dataset rows in bulk.
Manage datasets — tag and star, save snapshots, define schemas, customize table views, and edit records.
Use in evaluations by passing datasets to Eval(), assigning them to environments, or converting experiment results.
Track performance to see which experiments used a dataset and how each row performs.

For human review workflows on dataset records, see Human review and Custom views.

Add labels and corrections

Create datasets

⌘I

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Dataset structure

Where to go from here

​Dataset structure

​Where to go from here

Dataset structure

Where to go from here