CI/CD Integration

Automate your evaluation pipelines and catch regressions before they hit production. Run evals on every push without writing a GitHub Actions workflow.

Overview

The EvalsHub CI/CD integration runs your evaluation on every push via a GitHub webhook. EvalsHub receives the push event, runs the experiment, and records the run with commit SHA and branch. You set a baseline after a good run; later runs pass or fail by comparison. This keeps prompt and model changes from shipping when they regress quality.

Step-by-step: GitHub webhook

EvalsHub uses a webhook so you don’t need a GitHub Actions workflow. Configure it once per repo.

In EvalsHub go to Settings → CI / CD (or your project’s CI settings). Pick the project and experiment that should run on push. Create a CI key and copy the webhook URL and secret.
In your GitHub repo: Settings → Webhooks → Add webhook.
Set Payload URL to the URL from EvalsHub (it includes your project and experiment IDs).
Set Content type to application/json.
Paste your EvalsHub CI key into Secret so we can verify the request.
Under "Which events would you like to trigger this webhook?" choose Just the push event (or leave default if you’re okay with more events). Save the webhook.

On each push, GitHub sends the event to EvalsHub; we run the experiment and record the run (commit and branch from the payload). In the dashboard you’ll see CI run history with status (pass/fail), scores, and links to the commit.

Baselines and failing the build

After your first successful run, set it as the baseline. Future runs are compared to that baseline. You can configure the evaluation to fail the CI run (and optionally block the merge) when:

Overall accuracy (or chosen metric) falls below a fixed threshold.
Score drops by more than a configured delta vs the baseline.

Failed runs show up in the CI runs list with the reason (e.g. score below threshold). Use detailed experiment results in the dashboard to see which rows regressed.

What gets run

The CI run executes the same experiment you selected: same dataset, prompt version, and scorers. Only the trigger is automatic (push). So ensure the experiment is already working when run manually before relying on it in CI.

Pro tip

Run evaluations in parallel to your unit tests to minimize latency in your CI pipeline. EvalsHub handles concurrency; the webhook returns after the run is recorded so you can poll or use status checks if you integrate with GitHub status API separately.

Prompt Management Next: Red Teaming