A/B Testing

Compare prompt versions and models in real-time with live traffic. Route a percentage of requests to each variant and track quality, latency, and cost in the dashboard.

Introduction

A/B testing for AI measures output quality, latency, and cost across different prompt versions or model configs. In EvalsHub you create an A/B test in the dashboard (e.g. 50% traffic to version A, 50% to version B). In your app you use the SDK to get the assigned variant for each request, run your LLM with that variant's prompt, then send the trace with the same variant ID so the dashboard can show per-variant metrics and you can promote a winner.

How it works

Dashboard: Create an A/B test: pick a prompt and two versions (A and B) to compare, set traffic split (e.g. 50/50), then start the test.
Your app: Before each LLM call that should participate in the test, call the SDK to get the assigned variant (a prompt version ID).
Use the variant: Run your LLM with the prompt and config that correspond to that version (e.g. from your own config keyed by version ID, or from EvalsHub).
Trace with the variant: After the LLM call, send the trace to EvalsHub with that same promptVersionId so the trace is attributed to the correct variant.

The dashboard then shows quality score, cost, and latency per variant so you can decide when to stop the test and promote a winner.

How to use the SDK for A/B tests

Install the EvalsHub SDK, then in the code path where you call your LLM: call getVariant() first to get the assigned prompt version ID, use that to select which prompt/config to send to the LLM, then after the LLM call pass the same ID into trace(). You must use the variant to choose the prompt—otherwise every user gets the same prompt and the A/B test is invalid.

1. Install and configure

npm add evalshub openai

Set EVALSHUB_API_KEY and EVALSHUB_PROJECT_ID (from your EvalsHub project settings). Optionally EVALSHUB_BASE_URL (defaults to https://evalshub.ai; use http://localhost:3000 for local dev).

2. Get variant, run LLM, then trace

Before the LLM call: const { promptVersionId } = await getVariant(). Use promptVersionId to look up the prompt/config for that variant (e.g. a map from version ID to system prompt + model). When there is no active A/B test, promptVersionId is null—use your default prompt. After the LLM responds, call trace(payload, { promptVersionId }) with the same ID so the trace is attributed to the correct variant.

import { getVariant, trace } from "evalshub";

// 1. Get the assigned variant for this request (or null if no active A/B test)
const { promptVersionId } = await getVariant();
// If you have an active test, promptVersionId is either version A or B's ID.

// 2. Map version IDs to your prompt content (you define this—e.g. from EvalsHub or your config)
const promptsByVersionId = {
  "version-a-uuid": { systemPrompt: "You are helpful...", model: "gpt-4o" },
  "version-b-uuid": { systemPrompt: "You are concise...", model: "gpt-4o" },
};
const config = promptVersionId ? promptsByVersionId[promptVersionId] : defaultConfig;
if (!config) throw new Error("Unknown variant");

// 3. Call your LLM with that variant's prompt
const start = Date.now();
const response = await openai.chat.completions.create({
  model: config.model,
  messages: [{ role: "system", content: config.systemPrompt }, { role: "user", content: userInput }],
});
const latencyMs = Date.now() - start;

// 4. Trace with the same promptVersionId so the trace is attributed to the variant
trace(
  {
    model: config.model,
    input: [{ role: "system", content: config.systemPrompt }, { role: "user", content: userInput }],
    output: response.choices?.[0]?.message?.content ?? "",
    latencyMs,
  },
  { promptVersionId: promptVersionId ?? undefined }
);

Set EVALSHUB_API_KEY and EVALSHUB_PROJECT_ID (or pass them in the options to getVariant() and trace()). When there is no active A/B test, getVariant() returns { promptVersionId: null }—use your default prompt and you can omit promptVersionId when tracing.

Real-time analytics

As requests are traced, EvalsHub runs scorers in the background and aggregates by variant. The A/B test view shows:

Quality Scoree.g. 0–10 from your scorers
CostAvg. per 1k tokens
Latencye.g. P95 response time
DriftSemantic variance if measured

Statistical significance

EvalsHub computes confidence intervals and can indicate when a variant is ahead with statistical significance. Wait for enough sample size before declaring a winner; the dashboard helps you see when results are stable vs noisy.

Confidence intervals

We use standard statistical methods to estimate confidence intervals for your A/B tests. Avoid switching traffic based on early, low-sample results; use the suggested sample size or wait until confidence intervals no longer overlap before promoting a variant.

Red Teaming Next: SDK Reference