EvalsHub AI

Documentation

Welcome to the EvalsHub AI documentation. Learn how to build, test, and monitor your AI applications with precision and scale.

What is EvalsHub?

EvalsHub is an observability and evaluation platform for AI applications. You instrument your app once with our lightweight SDK—wrapping your OpenAI client or sending raw input/output for any provider—and every LLM call is traced to EvalsHub. From there you can run evaluations on datasets, compare prompt versions with A/B tests, automate checks in CI/CD, and run red-team safety scans.

  • Tracing — Capture model, messages, response text, and latency with minimal code.
  • Evaluations — Run experiments against datasets using built-in and custom scorers (e.g. LLM-as-a-judge).
  • CI/CD — Trigger evals on every push via GitHub webhooks and fail builds when scores regress.
  • Red Teaming — Automated safety scans for prompt injection, jailbreaking, and other adversarial tests.
  • A/B Testing — Route live traffic between prompt versions and compare quality, cost, and latency in real time.

Key concepts

Projects

A project groups prompts, datasets, experiments, and traces. You get an API key and project ID when you create a project; the SDK uses these to send traces to the right place.

Datasets

Golden test cases: rows of inputs (and optionally expected outputs). Datasets can be input-only (model generates output during the eval) or input/output (compare model output to expected). You create them by upload, manual entry, or by generating from production traces.

Experiments

An experiment runs a chosen prompt (and model) over a dataset and scores each row with your configured scorers. Results show pass/fail, scores, and latency. Experiments are what you run in CI and use to compare versions.

Prompts & versions

Prompts are versioned. Each version has a system prompt and model config. Evals and A/B tests run against specific versions so you can track quality over time.

Where to start

  • Quick Start — Install the SDK, wrap your client, and run your first eval in under 5 minutes.
  • SDK Reference — Environment variables, wrapOpenAI, trace(), and options for dataset ID, prompt version, and session grouping.
  • Dataset Management — Types, column mapping, creating and generating datasets.
  • Prompt Management — Versioning and using prompts in evals and A/B tests.
  • CI/CD Integration — Webhook setup, baselines, and failing the build on regressions.
  • Red Teaming — Safety categories and automated scans.
  • A/B Testing — Routing traffic and reading analytics.
  • Best Practices — Evaluation design, dataset curation, and production tracing.

Can't find what you're looking for?

Our support team is always ready to help you with any questions or integration challenges.

Contact Support