Dataset Management
Curate and manage your golden test cases for evaluation. Datasets are the foundation of experiments: each row is a test case, and EvalsHub runs your prompt over the dataset and scores results with your chosen scorers.
Dataset types
EvalsHub supports two dataset types. Choose based on whether you have reference answers or only inputs.
Input-only
Each row has at least an Input (e.g. user question). During the experiment, the model generates an output for each input. Scorers evaluate the generated output (e.g. LLM-as-a-judge for quality, or regex for format). Use this when you don't have gold answers or when you care about quality/style rather than exact match.
Input + Output
Each row has Input and Output (expected answer). You can also map a column to Expected for reference. Scorers can compare model output to the expected value (e.g. exact match, semantic similarity, or LLM judge with reference). Use this when you have labeled or golden test cases.
Column mapping
When you create or upload a dataset, your columns (e.g. from a CSV) are mapped to EvalsHub fields. The mapper recognizes common names automatically; you can adjust any mapping.
- Input — The prompt or user message(s) for the model. Required for every dataset.
- Output — Model output for input/output datasets. Used as "expected" when comparing.
- Expected — Reference or gold answer. Optional; used by scorers that need a reference (e.g. similarity to expected).
- Metadata — Optional key-value or JSON column for filtering or scorer context.
- Tags — Optional labels (e.g. category, difficulty) for grouping or filtering rows.
Columns you don’t need can be mapped to "— (skip)". Names like input, prompt, question map to Input; output, expected, answer map accordingly.
Creating datasets
You can create datasets in three ways from the Datasets area.
Upload CSV
Upload a CSV with headers. On the next step, map each column to an EvalsHub field (Input, Output, Expected, etc.). Supports quoted fields and commas inside values. Download the sample CSV from the create flow for the expected format.
Add rows manually
Create a new dataset, choose type (input-only or input+output), then add rows one by one. Useful for small golden sets or when you’re building test cases by hand.
Generate from production
Use Generate dataset to create synthetic or sampled data. Provide context about your app and (optionally) example query types; EvalsHub can generate input-only rows for you. You can also send traces to a dataset via the SDK (databaseID in options)—traces appear in the dataset’s "Setup tracing" tab and can be used to seed or extend datasets.
Best practices
- Keep datasets focused: one dataset per use case or slice (e.g. "FAQ answers", "refund policy") so experiments and baselines are interpretable.
- For input/output datasets, ensure expected outputs are consistent (same format and level of detail) so scorers behave predictably.
- Use tags or metadata to segment rows (e.g. difficulty, category) so you can filter or analyze subsets in experiments.
- Start with a small representative set (e.g. 20–50 rows) for fast iteration, then expand once your prompt and scorers are stable.