EvalsHub AI

SDK Reference

Integrate EvalsHub into your application with the evalshub package. Supports OpenAI via wrapOpenAI and any provider via trace().

Install

npm add evalshub openai

Use openai only if you use the OpenAI client; for other providers you only need evalshub and trace().

Environment variables

The SDK reads these by default. You can override them per call via the options object.

  • EVALSHUB_API_KEY(required) — API key from your project settings.
  • EVALSHUB_PROJECT_ID(required) — Project ID.
  • EVALSHUB_BASE_URL(optional) — Defaults to https://evalshub.ai. Set to http://localhost:3000 for local dev.
  • EVALSHUB_DATABASE_ID(optional) — Dataset ID to associate traces with (e.g. for "Setup tracing" and building datasets from production).

wrapOpenAI(client, options?)

Wraps an OpenAI client so that every chat.completions.create call is traced. The SDK sends model, messages, response text, and latency. The wrapper preserves the original API; your code stays the same.

import { wrapOpenAI } from "evalshub";
import OpenAI from "openai";

const openai = wrapOpenAI(new OpenAI());

// Same API as before; each call is traced
const res = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Hello" }],
});

You can pass the same options as for trace() (e.g. databaseID, promptVersionId, flush).

trace(payload, options?)

Provider-agnostic: send one LLM call to EvalsHub. Use this for Anthropic, custom APIs, or when you don’t use the OpenAI SDK. Payload is the raw input and output you want stored and scored.

import { trace } from "evalshub";

// After your LLM call
trace(
  {
    model: "gpt-4o-mini",
    input: [{ role: "user", content: prompt }],
    output: responseText,
    latencyMs: Date.now() - start,
    // optional:
    provider: "openai",
    metadata: { userId: "123" },
    databaseID: "dataset-uuid",
    promptVersionId: "version-uuid",
    sessionId: "session-uuid",
  },
  { apiKey, projectId, baseUrl, flush: false }
);

TracePayload fields

  • model — Model identifier (e.g. gpt-4o, claude-3-sonnet).
  • input — Raw input: messages array, prompt string, or any JSON-serializable object.
  • output — Raw response: completion text or any JSON-serializable object.
  • provider, latencyMs, metadata — Optional.
  • databaseID, promptVersionId, sessionId, parentLogId, spanOrder — Optional; can also be set in the second-argument options.

Options (EvalsHubOptions)

Pass as the second argument to trace() or wrapOpenAI(). Any option can override the environment variable.

  • apiKey, projectId, baseUrl — Same as env vars.
  • databaseID — Dataset ID. Traces show up in that dataset’s Setup tracing tab and can be used to build datasets from production.
  • promptVersionId — Attribute this trace to a prompt version for per-version scoring and A/B analytics.
  • sessionId — Group multiple spans into one trace (e.g. one conversation). Use with spanOrder for ordering.
  • parentLogId, spanOrder — For agentic/chained calls: parent span ID and order of this span in the session.
  • flush — If true, the function returns a Promise that resolves after the ingest request completes. Default false (fire-and-forget) so latency is not added to your app.

getVariant(options?)

For A/B testing: returns the assigned prompt version ID for the current request. Call getVariant() before your LLM call, use that version's prompt when calling the LLM, then pass the same promptVersionId into trace(). Returns { promptVersionId: string | null }; when there is no active A/B test, promptVersionId is null. See A/B Testing for the full integration flow.

A/B test attribution (trace only)

If you call trace() without promptVersionId, the SDK may fetch the active A/B test and assign a variant for that trace only. The trace is then attributed to that variant. For real A/B testing you should use getVariant() first so your app can run the LLM with the correct prompt for that variant, then pass the same promptVersionId into trace().