LangChain Evaluators Complete Guide 2026
Master LangChain evaluators in 2026. Complete guide to string evaluators, comparison evaluators, trajectory evaluators, custom evaluators, and integration with LangSmith for production tracking.
LangChain Evaluators Complete Guide 2026
LangChain shipped its evaluator system to let teams measure the quality of chains, agents, and RAG pipelines without leaving the LangChain ecosystem. The evaluators integrate with LangSmith for production tracking, support both reference-free and reference-based evaluation, and cover string outputs, paired comparisons, and agent trajectories. If you build on LangChain, the evaluator module is the natural choice for quality measurement: it speaks the same data types as your chains and uses the same model wrappers.
This guide is a complete tour of LangChain evaluators in 2026. We cover the three evaluator categories (string, comparison, trajectory), every built-in evaluator type, custom evaluator authoring, the integration with LangSmith, and how LangChain evaluators compare to alternatives like Ragas and OpenAI Evals. Code samples are full Python that you can run as-is. By the end you should be able to instrument your LangChain application with comprehensive evaluation in an afternoon. Use this as a reference whenever you set up new LangChain evals.
Key Takeaways
- LangChain ships three evaluator categories: string, comparison, and trajectory. Each targets a different evaluation need.
- String evaluators score a single output (criteria, embedding distance, JSON validity, regex, exact match, etc.).
- Comparison evaluators score two outputs against each other (pairwise preference, head-to-head).
- Trajectory evaluators score multi-step agent trajectories (tool-use correctness, step efficiency).
- All evaluators integrate with LangSmith for production monitoring, dataset versioning, and dashboards.
- Custom evaluators are simple to write by extending the base evaluator class.
Why LangChain Evaluators
If your application is built on LangChain, using LangChain evaluators provides three concrete advantages. First, the data types match. Your chain produces a LangChain Run; the evaluator consumes a Run. No conversion required. Second, the LLM wrappers are the same. The evaluator uses the same ChatOpenAI or ChatAnthropic instance you already configured. Third, LangSmith integration is one line. Evaluations log to LangSmith automatically.
For non-LangChain applications, the evaluator module can still work but loses some of the seamlessness. In those cases, Ragas or OpenAI Evals may be a better fit. But within the LangChain ecosystem, the evaluators are the path of least resistance.
Installation
pip install langchain langchain-openai langsmith
export OPENAI_API_KEY=sk-...
export LANGCHAIN_API_KEY=ls__... # for LangSmith integration
export LANGCHAIN_TRACING_V2=true
The evaluator module is in langchain.evaluation. The base evaluators work without LangSmith, but you lose dashboards and dataset versioning. For serious use, LangSmith is recommended.
String Evaluators
String evaluators score a single string output against a reference or a criterion. They are the most common evaluator type.
Criteria evaluator
The criteria evaluator scores an output against a list of criteria using an LLM judge. Built-in criteria include conciseness, relevance, correctness, coherence, harmfulness, and helpfulness.
from langchain.evaluation import load_evaluator
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
evaluator = load_evaluator("criteria", criteria="conciseness", llm=llm)
result = evaluator.evaluate_strings(
prediction="The capital of France is Paris, which is a city in Europe famous for many things including the Eiffel Tower.",
input="What is the capital of France?",
)
print(result)
The result is a dict with score (0 or 1) and reasoning. The judge LLM explains its decision, which helps debug grader behavior.
You can define custom criteria.
custom_criteria = {
"professionalism": "Is the response written in a professional tone appropriate for customer service?"
}
evaluator = load_evaluator("criteria", criteria=custom_criteria, llm=llm)
Labeled criteria
Labeled criteria evaluators compare the prediction to a reference answer. Use this when you have ground truth.
evaluator = load_evaluator("labeled_criteria", criteria="correctness", llm=llm)
result = evaluator.evaluate_strings(
prediction="The capital is Paris.",
input="What is the capital of France?",
reference="The capital of France is Paris.",
)
The judge sees the reference and scores the prediction relative to it. Useful when paraphrase is acceptable but you want to catch wrong answers.
Embedding distance
The embedding distance evaluator computes the cosine similarity between prediction and reference embeddings. The metric is a float between 0 and 1; closer to 1 means more similar.
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("embedding_distance")
result = evaluator.evaluate_strings(
prediction="Paris is the capital.",
reference="The capital is Paris.",
)
Embedding distance is fast and cheap. Use it as a pre-filter; if similarity is very low, the prediction probably needs review.
String distance
The string distance evaluator uses edit distance metrics (Levenshtein, Damerau-Levenshtein, Jaro-Winkler, etc.). Useful for tasks where the output is a specific string with potential typos or variations.
evaluator = load_evaluator(
"string_distance",
distance="damerau_levenshtein",
)
JSON evaluators
For structured output, JSON evaluators check schema compliance and content equality.
from langchain.evaluation import JsonValidityEvaluator, JsonEqualityEvaluator
validity = JsonValidityEvaluator()
result = validity.evaluate_strings(prediction='{"name": "Paris"}')
equality = JsonEqualityEvaluator()
result = equality.evaluate_strings(
prediction='{"name": "Paris"}',
reference='{"name": "Paris"}',
)
JsonSchemaEvaluator validates against a JSON Schema.
from langchain.evaluation import JsonSchemaEvaluator
evaluator = JsonSchemaEvaluator()
result = evaluator.evaluate_strings(
prediction='{"name": "Paris"}',
reference='{"type": "object", "required": ["name"]}',
)
Regex match
evaluator = load_evaluator("regex_match")
result = evaluator.evaluate_strings(
prediction="Order ID: 12345",
reference=r"Order ID: \d+",
)
Exact match
evaluator = load_evaluator("exact_match")
result = evaluator.evaluate_strings(prediction="Paris", reference="Paris")
| String Evaluator | Best For | Reference Required |
|---|---|---|
| criteria | Quality dimensions | No |
| labeled_criteria | Reference comparison | Yes |
| embedding_distance | Semantic similarity | Yes |
| string_distance | Lexical similarity | Yes |
| json_validity | Structure check | No |
| json_equality | Exact JSON match | Yes |
| json_schema | Schema validation | Yes (schema) |
| regex_match | Pattern matching | Yes (pattern) |
| exact_match | Deterministic match | Yes |
Comparison Evaluators
Comparison evaluators score two predictions against the same input. Useful for A/B testing prompts or models.
Pairwise string
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("pairwise_string", llm=llm)
result = evaluator.evaluate_string_pairs(
prediction="The capital of France is Paris.",
prediction_b="Paris is in France.",
input="What is the capital of France?",
)
The judge picks the better response. The result includes a score (A or B) and reasoning.
Labeled pairwise
evaluator = load_evaluator("labeled_pairwise_string", llm=llm)
result = evaluator.evaluate_string_pairs(
prediction="...",
prediction_b="...",
input="...",
reference="...",
)
The labeled variant uses a reference answer to bias the judge.
Pairwise embedding distance
evaluator = load_evaluator("pairwise_embedding_distance")
Computes embedding similarity between two predictions.
Pairwise evaluators are the foundation for preference learning and side-by-side comparisons. Useful when you want to know "which is better" rather than "is this good."
Trajectory Evaluators
Trajectory evaluators score multi-step agent trajectories. They take the full sequence of tool calls and reasoning as input, not just the final answer.
Trajectory evaluator
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("trajectory", llm=llm)
result = evaluator.evaluate_agent_trajectory(
prediction="The weather in Paris is sunny.",
input="What's the weather in Paris?",
agent_trajectory=[
("AgentAction(tool='weather_api', tool_input='Paris')", "sunny, 22 C")
],
)
The judge scores whether the agent's trajectory is reasonable: were the tools necessary, were arguments correct, were the right number of steps taken.
For agent quality, trajectory evaluation is essential. A final answer of "the weather is sunny" looks good in a string evaluator but might come from an agent that called five wrong tools first.
Custom Evaluators
Custom evaluators extend the base StringEvaluator or PairwiseStringEvaluator class. The contract is small.
from langchain.evaluation import StringEvaluator
from typing import Any, Optional
class HasCitationEvaluator(StringEvaluator):
@property
def requires_input(self) -> bool:
return False
@property
def requires_reference(self) -> bool:
return False
@property
def evaluation_name(self) -> str:
return "has_citation"
def _evaluate_strings(
self,
*,
prediction: str,
reference: Optional[str] = None,
input: Optional[str] = None,
**kwargs: Any,
) -> dict:
has_citation = "Source:" in prediction or "[1]" in prediction
return {"score": 1 if has_citation else 0}
Custom evaluators integrate with LangSmith the same way as built-in evaluators. They appear in the dashboard with their evaluation_name.
LangSmith Integration
LangSmith is where the LangChain evaluator system shines. With one line, evaluation results stream to a dashboard with historical trends, regression detection, and dataset versioning.
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
def target_chain(inputs):
return {"prediction": my_chain.invoke(inputs["question"])}
results = evaluate(
target_chain,
data="my-dataset",
evaluators=[load_evaluator("criteria", criteria="conciseness", llm=llm)],
experiment_prefix="prompt-v2",
)
The evaluate function runs the chain against every example in the dataset, applies the evaluators, and logs results to LangSmith. The dashboard shows results grouped by experiment, with diff views between experiments.
Datasets in LangSmith are versioned. You can add, remove, or modify examples and the framework tracks which dataset version each experiment ran against. This avoids confusion when datasets change.
Production Monitoring
LangSmith production monitoring tracks evaluator scores on real traffic. Configure online evaluators that run on every production trace.
from langsmith import Client
client = Client()
client.create_rule(
name="conciseness-online",
project_name="prod",
evaluator_id="criteria-conciseness",
sampling_rate=0.1, # 10% of traces
)
Online evaluators run on a sample of traces and feed scores to the dashboard. Set sampling rates to control cost: 10% is usually enough to detect regressions without paying full price.
Alerts fire when an online evaluator score drops below a threshold. The alert includes the failing trace so you can investigate.
Comparison to Alternatives
| Framework | LangChain Integration | Reference-Free | Production Monitoring | Best For |
|---|---|---|---|---|
| LangChain Evaluators | Native | Yes | Via LangSmith | Teams on LangChain |
| Ragas | Adapter | Yes | Via export | RAG-specific |
| OpenAI Evals | Adapter | Yes | Yes | Agents, broad coverage |
| TruLens | Adapter | Yes | Yes | Custom feedback functions |
| DeepEval | None | Yes | Via export | Pytest-native |
LangChain evaluators are the natural choice if your application is on LangChain. The integration is seamless and LangSmith provides excellent production tooling. For non-LangChain applications, the seamlessness is lost and other frameworks become competitive.
Common Patterns
Pattern 1: chain quality check. Wrap your chain in a test that runs a small dataset through it and applies multiple evaluators (correctness, conciseness, relevance). Fail CI if any evaluator drops below threshold.
Pattern 2: model comparison. Use pairwise evaluators to compare two versions of a chain. Useful when changing models or prompts.
Pattern 3: production sampling. Configure online evaluators on production traces with a low sampling rate. Track quality over time without paying full eval cost.
Pattern 4: agent trajectory analysis. For agents, use trajectory evaluators alongside string evaluators. Trajectory catches process issues; string catches output issues.
Cost Optimization
Each evaluator that uses an LLM judge costs money. For large datasets, costs add up.
Use cheaper judges where possible. GPT-4o-mini is acceptable for low-stakes graders.
Cache results by input hash. If the same example has been evaluated before with the same code, reuse the score.
Sample production traces rather than evaluating all. 10% sampling is usually enough.
Run light suites on PRs and full suites nightly.
Common Pitfalls
Inconsistent judge models. The default judge for criteria evaluators varies by LangChain version. Pin the judge explicitly for reproducibility.
Treating embedding distance as ground truth. Embeddings capture meaning, not correctness. A confident lie can have high embedding similarity to the truth.
Skipping reference data for labeled evaluators. Without a reference, the judge falls back to its own knowledge, which may be wrong.
Forgetting to filter LangSmith experiments. Once you run many experiments, the dashboard gets noisy. Use tags to filter.
Migrating from Other Frameworks
If you currently use Ragas or OpenAI Evals and want to move to LangChain evaluators:
Datasets convert easily. Ragas datasets are HF Datasets; OpenAI Evals datasets are JSONL. Both map to LangSmith datasets.
Graders need to be reimplemented. Most have direct equivalents. Custom graders need to be ported to the StringEvaluator interface.
Production monitoring is more turnkey with LangSmith. If you previously exported scores to your own dashboard, LangSmith may save you the work.
Further Resources
- LangChain evaluator documentation.
- LangSmith dashboard tour at /blog (LangSmith Evaluation Platform Guide).
- Browse LLM evaluation skills at /skills.
Conclusion
LangChain evaluators are the path of least resistance for LangChain users. Three evaluator categories, dozens of built-in types, and seamless LangSmith integration cover most evaluation needs without leaving the ecosystem. Start with the criteria evaluator on a small dataset, layer in pairwise comparisons as you iterate on prompts, and configure online evaluators when you ship to production. For deeper resources, browse /skills and the /blog.