Best Practices

Recommendations for evaluation design, dataset curation, production tracing, and CI so you get reliable, actionable signals from EvalsHub.

Evaluation design

One experiment, one question. Keep each experiment focused on a single prompt version and dataset. Use multiple experiments to compare versions or datasets rather than overloading one run.
Choose scorers that match your goal. Use reference-based scorers (e.g. similarity to expected) when you have gold answers; use LLM-as-a-judge or format checks when you care about quality or structure without a single right answer.
Set a baseline before shipping. Run the experiment on the version you’re about to deploy and set that as the CI baseline. Future runs will pass or fail relative to it.

Slice by use case. Separate datasets by feature or slice (e.g. "FAQ", "refunds", "edge cases") so you can see which area regressed and run smaller, faster CI jobs per slice if needed.
Keep expected outputs consistent. For input/output datasets, use a consistent format and level of detail in the expected column so scorers behave predictably.
Start small, then grow. Begin with 20–50 representative rows to iterate quickly on prompts and scorers. Add more rows (including edge cases and failures from production) once the pipeline is stable.
Use tags and metadata. Tag rows by difficulty, category, or source so you can filter and analyze subsets in the dashboard.

Use fire-and-forget by default. Leave flush: false so tracing does not add latency. Set flush: true only when you need to guarantee the trace was sent before continuing (e.g. in a short serverless function).
Set promptVersionId when you know it. If your app already resolves the prompt version (e.g. from A/B or config), pass it so the SDK doesn’t need to fetch A/B config and so the dashboard attributes traces correctly.
Use databaseID to feed datasets. Sending traces to a dataset (via databaseID) lets you inspect them in Setup tracing and turn production samples into evaluation rows. Use a dedicated dataset per project or use case.
Group agentic calls with sessionId. For multi-step or agent flows, pass a stable sessionId and optional parentLogId / spanOrder so the timeline view shows a single trace.

Run the experiment manually first. Before wiring the webhook, run the chosen experiment from the dashboard and confirm it completes and scores make sense. Then set the baseline and add the webhook.
Use thresholds that match your bar. Set pass/fail thresholds (absolute or delta) so that real regressions fail the run but normal variance doesn’t. Tune after a few runs.
Keep datasets and prompts stable for CI. Avoid changing the CI experiment’s dataset or prompt version too often; use CI to catch regressions on a fixed baseline, and use the dashboard for ad-hoc comparison.

Run Red Team evals before release. Add a Red Team experiment to your pre-release checklist or CI so safety regressions are caught early.
Wait for significance in A/B tests. Don’t declare a winner on the first day. Use the dashboard’s confidence/sample guidance and avoid switching 100% traffic until results are stable.