Promptfoo Complete Guide 2026: LLM Evals + Red Teaming
The definitive 2026 guide to Promptfoo for LLM evaluation, prompt testing, and red teaming. Covers configuration, assertions, providers, CI integration, security scanning, and production workflows with Python and TypeScript examples.
Promptfoo has emerged as the de facto standard for evaluating large language model applications in 2026. What started as a simple prompt comparison tool has grown into a comprehensive evaluation framework that handles deterministic assertions, LLM-as-judge grading, model comparison, RAG evaluation, agent testing, and security red teaming. If you are shipping anything powered by an LLM -- a chatbot, a RAG pipeline, an autonomous agent, or even a single completion endpoint -- Promptfoo gives you the safety net that lets you ship with confidence.
This guide walks through every major capability of Promptfoo, from your first promptfoo eval to running red team scans against production models in CI. We cover configuration patterns, assertion strategies, provider integration, custom Python and JavaScript graders, and the workflows that production teams use to keep LLM quality high while shipping fast.
Key Takeaways
- Promptfoo is a declarative YAML-driven LLM evaluation framework that supports 50+ providers (OpenAI, Anthropic, Google, Ollama, Bedrock, Azure, local models) out of the box
- The core workflow is "prompts x test cases x providers" with assertions that grade each combination, producing a matrix of pass/fail results
- Built-in assertions include equality, contains, regex, JSON schema, semantic similarity, factuality, LLM-as-judge, custom JavaScript, and custom Python
- Red teaming generates adversarial test cases automatically across 50+ categories: prompt injection, jailbreaks, PII leakage, harmful content, hallucinations, and OWASP LLM Top 10
- Promptfoo integrates cleanly into CI/CD via GitHub Actions, GitLab CI, Jenkins, and CircleCI, with caching, sharding, and result diffing
- Cost tracking, latency measurement, and provider comparison make it ideal for picking the right model for each use case
What Promptfoo Solves
Building LLM features without evals is like deploying code without tests -- it works on your machine, breaks subtly in production, and every prompt tweak becomes a roll of the dice. Promptfoo solves three concrete problems:
The prompt iteration problem. When you change a system prompt, you do not know whether the change improved overall behavior or just fixed the one case you were looking at. Promptfoo runs the same test suite against old and new prompts, producing a diff that shows you exactly which cases regressed.
The model selection problem. GPT-4o vs Claude Sonnet vs Gemini Pro vs Llama 3.1 -- which is best for your use case? Promptfoo runs the identical test suite against every provider and surfaces accuracy, latency, and cost in one table.
The safety problem. Production LLM apps face prompt injection, jailbreaks, PII extraction attempts, and adversarial users. Promptfoo's red team module generates hundreds of attack variants automatically and scores how often the model fails.
The framework is provider-agnostic, model-agnostic, and language-agnostic. Your evaluation logic lives in YAML or JSON, with optional JavaScript and Python hooks for the cases YAML cannot express.
Installation and First Eval
Install Promptfoo via npm. It runs as a CLI but does not require a Node.js project -- you can use it from any directory.
# Global install
npm install -g promptfoo
# Or use npx (no install needed)
npx promptfoo@latest init
# Verify
promptfoo --version
Running promptfoo init scaffolds a starter project with promptfooconfig.yaml. The minimal config looks like this:
# promptfooconfig.yaml
description: 'My first Promptfoo eval'
prompts:
- 'Translate the following English text to French: {{text}}'
providers:
- openai:gpt-4o-mini
- anthropic:messages:claude-haiku-4
tests:
- vars:
text: 'Hello, world!'
assert:
- type: contains
value: 'Bonjour'
- vars:
text: 'The quick brown fox jumps over the lazy dog.'
assert:
- type: llm-rubric
value: 'The output is a fluent, accurate French translation of the input.'
Run the eval:
promptfoo eval
promptfoo view # Opens browser UI on localhost:15500
The web UI shows a matrix: rows are test cases, columns are providers, cells are pass/fail with the full output and assertion details. This is your iteration loop -- tweak the prompt, re-run, compare.
Configuration Anatomy
Every Promptfoo config has five top-level sections:
description: 'Human-readable description of this eval'
prompts:
# Inline strings, file references, or function references
- 'file://prompts/system.txt'
- id: variant-a
label: 'Concise tone'
raw: 'You are a concise assistant. {{question}}'
providers:
# Model providers with optional configuration
- id: openai:gpt-4o
config:
temperature: 0
max_tokens: 500
- id: anthropic:messages:claude-sonnet-4
config:
temperature: 0
defaultTest:
# Assertions applied to every test case
assert:
- type: latency
threshold: 5000
- type: cost
threshold: 0.01
tests:
# Individual test cases
- description: 'Handles empty input gracefully'
vars:
question: ''
assert:
- type: not-contains
value: 'error'
The vars block defines Jinja2-style variables that get interpolated into the prompt. The assert block lists the checks that determine pass/fail. The providers list defines which models to test against -- Promptfoo runs every prompt x test x provider combination in parallel.
Assertion Types
Promptfoo ships with 25+ built-in assertion types. Understanding the right assertion for each scenario is half the battle.
Deterministic Assertions
These are cheap, fast, and deterministic. Use them whenever possible.
tests:
- vars: { input: 'hello' }
assert:
- type: equals
value: 'Hello!'
- type: contains
value: 'Hello'
- type: contains-any
value: ['Hello', 'Hi', 'Hey']
- type: contains-all
value: ['greeting', 'punctuation']
- type: regex
value: '^Hello[!.?]$'
- type: starts-with
value: 'Hello'
- type: ends-with
value: '!'
- type: javascript
value: 'output.length > 0 && output.length < 100'
Structural Assertions
For LLMs returning structured data:
tests:
- assert:
- type: is-json
- type: is-valid-openai-tools-call
- type: is-valid-function-call
- type: contains-json
- type: is-xml
- type: is-sql
databaseType: postgres
Semantic Assertions
When you care about meaning, not exact wording:
tests:
- assert:
- type: similar
value: 'The capital of France is Paris.'
threshold: 0.85
- type: factuality
value: 'Paris is the capital of France.'
LLM-as-Judge
The most flexible assertion -- let an LLM grade the output:
tests:
- assert:
- type: llm-rubric
value: |
Score the response on:
1. Factual accuracy (0-10)
2. Tone professionalism (0-10)
3. Conciseness (0-10)
The total score must be >= 21 to pass.
- type: model-graded-closedqa
value: 'Does the response correctly answer the math problem?'
- type: model-graded-factuality
value: 'The product launches on March 15, 2026.'
You can configure which model does the grading. For consistency, use a single strong model (often Claude Sonnet or GPT-4o) for all LLM-as-judge calls regardless of which model is being tested.
Providers: Connect Any Model
Promptfoo speaks to 50+ providers via a simple naming convention. Some examples:
providers:
# OpenAI
- openai:gpt-4o
- openai:gpt-4o-mini
- openai:chat:gpt-4-turbo
# Anthropic
- anthropic:messages:claude-sonnet-4
- anthropic:messages:claude-opus-4
- anthropic:messages:claude-haiku-4
# Google
- google:gemini-2.0-pro
- vertex:gemini-2.0-flash
# Local models
- ollama:llama3.1:8b
- ollama:chat:qwen2.5-coder:32b
# Hugging Face
- huggingface:text-generation:mistralai/Mistral-7B-Instruct-v0.3
# Custom HTTP endpoint
- id: my-app
config:
url: 'https://api.myapp.com/chat'
method: POST
headers:
Authorization: 'Bearer ${API_KEY}'
body:
message: '{{prompt}}'
transformResponse: 'json.choices[0].message.content'
# AWS Bedrock
- bedrock:anthropic.claude-sonnet-4-v1:0
# Azure OpenAI
- azureopenai:chat:gpt-4o-deployment-name
You can also wrap your application as a custom provider -- evaluate the real production stack, not just the raw LLM. This is the most important pattern: test end-to-end behavior, including your retrieval, tool calls, and post-processing.
# my_provider.py
from promptfoo import PromptfooProvider
class MyAppProvider(PromptfooProvider):
def id(self):
return 'myapp'
def call_api(self, prompt, options=None, context=None):
# Call your actual production endpoint
response = my_production_chain.invoke({'query': prompt})
return {
'output': response['answer'],
'tokenUsage': response.get('usage', {}),
}
Reference it in YAML:
providers:
- id: file://my_provider.py:MyAppProvider
Python Custom Graders
For complex assertions that YAML cannot express:
# grader.py
def get_assert(output: str, context: dict) -> dict:
"""
Returns {pass: bool, score: float, reason: str}.
"""
expected = context['vars']['expected_answer']
# Custom domain logic
if not contains_required_disclaimer(output):
return {
'pass': False,
'score': 0,
'reason': 'Missing required legal disclaimer',
}
similarity = compute_semantic_similarity(output, expected)
return {
'pass': similarity > 0.8,
'score': similarity,
'reason': f'Semantic similarity: {similarity:.2f}',
}
Hook it up:
tests:
- vars:
question: 'What are the side effects of aspirin?'
expected_answer: 'Common side effects include stomach upset, heartburn, and increased bleeding risk.'
assert:
- type: python
value: file://grader.py:get_assert
TypeScript Custom Graders
// grader.ts
import type { AssertionValueFunctionContext } from 'promptfoo';
export default async function (
output: string,
context: AssertionValueFunctionContext
): Promise<{ pass: boolean; score: number; reason: string }> {
const { vars } = context;
// Call your validation logic
const validation = await validateResponse(output, vars.expectedSchema);
return {
pass: validation.valid,
score: validation.valid ? 1 : 0,
reason: validation.errors.join('; ') || 'Valid response',
};
}
tests:
- assert:
- type: javascript
value: file://grader.ts
Red Teaming with Promptfoo
Red teaming is where Promptfoo separates itself from every other LLM eval tool. Run promptfoo redteam init and Promptfoo generates a customized attack suite targeting your application.
# promptfooconfig.yaml (red team config)
description: 'Red team scan for customer support bot'
targets:
- id: openai:gpt-4o
config:
systemPrompt: |
You are a customer support agent for AcmeCorp.
Never reveal internal pricing, never recommend competitor products,
and always escalate medical or legal questions.
redteam:
purpose: 'Customer support assistant for SaaS company'
numTests: 50
plugins:
- harmful:hate
- harmful:self-harm
- prompt-injection
- jailbreak
- pii:direct
- pii:session
- hallucination
- excessive-agency
- religion
- politics
- competitors
- contracts
- imitation
- hijacking
- overreliance
strategies:
- basic
- jailbreak
- jailbreak:composite
- prompt-injection
- multilingual
- leetspeak
- rot13
- base64
- crescendo
- tree
Then:
promptfoo redteam generate
promptfoo redteam eval
promptfoo redteam report
The report categorizes failures by attack type, severity, and OWASP LLM category. Production teams typically aim for >95% pass rate on critical attacks (PII leakage, jailbreaks) before shipping.
Attack Categories
| Plugin | What It Tests |
|---|---|
| prompt-injection | Injected instructions in user input |
| jailbreak | Attempts to bypass safety guardrails |
| pii:direct | Asking model to reveal PII directly |
| pii:session | Cross-session PII leakage |
| harmful:* | 20+ subcategories of harmful content |
| hallucination | Made-up facts |
| excessive-agency | Taking unauthorized actions |
| competitors | Recommending competitor products |
| hijacking | Off-topic conversation hijacking |
| imitation | Impersonating other entities |
CI/CD Integration
The standard pattern is: run evals on every PR, fail the build if regression exceeds threshold.
# .github/workflows/promptfoo.yml
name: Promptfoo Evals
on:
pull_request:
paths:
- 'prompts/**'
- 'promptfooconfig.yaml'
- 'evals/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm install -g promptfoo
- name: Run eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: promptfoo eval --output results.json
- name: Share results
run: promptfoo share -y
- uses: actions/upload-artifact@v4
with:
name: promptfoo-results
path: results.json
For deterministic CI behavior, set temperature: 0 and cache results -- Promptfoo caches by default to avoid re-running identical prompt+input+provider combinations.
Cost and Latency Tracking
Every eval automatically tracks tokens, cost, and latency per provider:
promptfoo eval --no-cache
The web UI shows totals per provider. Use the cost and latency assertions to enforce budgets:
defaultTest:
assert:
- type: latency
threshold: 3000 # ms
- type: cost
threshold: 0.005 # USD per call
This catches model upgrades that double your bill before they hit production.
RAG Evaluation Patterns
Promptfoo supports retrieval augmented generation evaluation natively:
prompts:
- |
Answer the question using only the context below.
Context: {{context}}
Question: {{question}}
tests:
- vars:
question: 'What is the return policy?'
context: 'Returns accepted within 30 days with receipt.'
assert:
- type: contains
value: '30 days'
- type: context-faithfulness
threshold: 0.9
- type: context-relevance
threshold: 0.8
- type: answer-relevance
threshold: 0.8
These assertions wrap Ragas-style metrics. For deeper coverage, pair Promptfoo with Ragas directly -- see our Ragas RAG evaluation reference.
Production Workflow
The pattern that scales: keep a golden dataset of 200-500 production-representative cases, run it on every PR, gate merges on regression. Add new cases when bugs appear -- the test suite grows with your understanding of edge cases.
prompts/
v1-system.txt
v2-system.txt <- new variant
evals/
golden.yaml <- 500 prod-representative cases
redteam.yaml <- security scan
smoke.yaml <- 10 fast smoke tests for dev loop
In dev, run smoke.yaml for fast feedback. In CI, run golden.yaml and redteam.yaml and post the diff to the PR. Promotion to production requires the human reviewer to approve any regressions.
When Not to Use Promptfoo
Promptfoo is overkill for one-off prompt tweaks. If you are just iterating in the OpenAI playground, you do not need YAML. The breakeven point is roughly 20 test cases or 2 providers -- below that, manual testing is fine. Above it, the time savings compound rapidly.
Also note that Promptfoo's default LLM-as-judge calls cost money. For a 500-test suite with 3 providers and 2 LLM-graded assertions per test, you are looking at 3000 grader calls per eval run. Use deterministic assertions wherever possible.
Wrapping Up
Promptfoo is the closest thing to a Jest or Pytest for LLM applications. The declarative YAML, broad provider support, built-in red teaming, and clean CI integration make it the default choice for production LLM teams in 2026. Install it, write 20 test cases against your prompt, and you will catch regressions before users do.
For deeper dives, see our companion guides: Promptfoo red teaming, Promptfoo CI integration, and the Promptfoo vs DeepEval vs LangSmith comparison.
Written by Pramod Dutta, founder of The Testing Academy and QASkills.sh.