Arize Phoenix LLM Evaluation Complete Guide 2026
Master Arize Phoenix for LLM observability and evaluation. Tracing, datasets, evaluations, RAG metrics, embeddings analysis, and self-hosting in 2026.
Arize Phoenix LLM Evaluation Complete Guide 2026
Arize Phoenix is the open-source LLM observability platform that runs locally on your laptop or in your cloud. By 2026 it has become the go-to choice for teams that want LangSmith-level features without sending data to a third party. Phoenix includes tracing, dataset management, evaluations, RAG-specific metrics, and embeddings analysis. The platform is built by Arize AI, which also offers a hosted enterprise product, but Phoenix itself is free, self-hostable, and powerful enough to be the only platform many teams need.
This guide covers Phoenix end to end: installation, OpenTelemetry-based tracing, dataset creation, running evaluations, RAG metrics, embeddings analysis, and integrating with your CI pipeline. We include Python samples for every step and a setup checklist for new teams. By the end you should be able to spin up Phoenix locally or in your cloud and use it to measure quality across LLM projects. The guide assumes basic Python familiarity and that you already use OpenAI or another LLM provider.
Key Takeaways
- Phoenix is an open-source LLM observability and evaluation platform from Arize AI; it self-hosts in one command.
- Tracing uses OpenTelemetry, so any OTel-compatible LLM library plugs in without modification.
- Built-in evaluators cover faithfulness, relevance, hallucination, toxicity, and custom criteria.
- RAG metrics include context relevance, hallucination detection, and response faithfulness.
- Embeddings analysis surfaces clusters and outliers in your input distribution.
- For teams that need self-hosted observability and evaluation without subscription cost, Phoenix is the strongest option.
Why Phoenix
The Arize Phoenix sweet spot is teams that want LangSmith-level functionality without:
Sending data to a third party.
Paying per-trace pricing at scale.
Committing to a specific LLM framework.
Phoenix runs on your laptop for development and in your cloud for production. It uses OpenTelemetry for tracing, so any instrumented LLM library works. The evaluation suite covers the common cases and is extensible.
Compared to other open-source options (Langfuse, Helicone), Phoenix leans more into evaluation. Compared to LangSmith and Weave, Phoenix is self-hosted and open source.
Installation
pip install arize-phoenix arize-phoenix-evals openai
Launch Phoenix in a notebook or as a standalone service.
import phoenix as px
px.launch_app()
This opens the Phoenix UI in your browser (default port 6006). For production, run Phoenix as a Docker container.
docker run -p 6006:6006 -i -t arizephoenix/phoenix
The container runs Phoenix on port 6006. Point your applications at it for tracing.
Tracing
Phoenix uses OpenTelemetry-based tracing. Instrument your LLM calls with the Phoenix instrumenters.
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from phoenix.otel import register
tracer_provider = register(project_name="my-llm-app")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
The instrumenter wraps the OpenAI client so every call is traced. Traces flow to the Phoenix UI.
For other providers (Anthropic, Cohere, Mistral), use the matching instrumenter from the openinference package.
For LangChain:
from openinference.instrumentation.langchain import LangChainInstrumentor
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
For LlamaIndex:
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
The instrumenters cover the common LLM libraries. For custom code, use the OpenTelemetry SDK directly.
Datasets
Phoenix datasets are versioned collections of inputs and expected outputs.
import phoenix as px
import pandas as pd
df = pd.DataFrame({
"question": ["What is the capital of France?", "What is 7 * 8?"],
"expected_answer": ["Paris", "56"],
})
dataset = px.Client().upload_dataset(
dataset_name="qa-v1",
dataframe=df,
input_keys=["question"],
output_keys=["expected_answer"],
)
You can create datasets from traces. Sample 100 production traces, label the good ones, and convert to a dataset.
Evaluations
A Phoenix evaluation runs a function over a dataset and applies evaluators.
from phoenix.experiments import run_experiment
def my_app(input):
return {"answer": call_llm(input["question"])}
def correctness_eval(input, output):
return output["answer"].lower() == input["expected_answer"].lower()
experiment = run_experiment(
dataset=dataset,
task=my_app,
evaluators=[correctness_eval],
experiment_name="prompt-v1",
)
The experiment runs the task on every example, scores the output, and logs to the Phoenix UI. The dashboard shows per-example scores, aggregate metrics, and comparisons across experiments.
Built-in Evaluators
Phoenix ships evaluators for the common LLM quality dimensions.
from phoenix.evals import (
HALLUCINATION_PROMPT_TEMPLATE,
QA_PROMPT_TEMPLATE,
RELEVANCE_PROMPT_TEMPLATE,
TOXICITY_PROMPT_TEMPLATE,
llm_classify,
OpenAIModel,
)
model = OpenAIModel(model="gpt-4o", temperature=0)
# QA correctness evaluator
qa_results = llm_classify(
dataframe=df_with_outputs,
template=QA_PROMPT_TEMPLATE,
model=model,
rails=["correct", "incorrect"],
)
The built-in prompt templates handle the common eval prompts (QA correctness, hallucination, relevance, toxicity). Each runs the judge LLM and returns a label per example.
| Built-in Evaluator | Measures |
|---|---|
| QA_PROMPT_TEMPLATE | Correctness of answer to question |
| RELEVANCE_PROMPT_TEMPLATE | Document relevance to query |
| HALLUCINATION_PROMPT_TEMPLATE | Whether response is grounded |
| TOXICITY_PROMPT_TEMPLATE | Harmful content detection |
| SUMMARIZATION_PROMPT_TEMPLATE | Summary quality |
| HUMAN_VS_AI_PROMPT_TEMPLATE | Preference grading |
RAG-Specific Metrics
Phoenix has built-in support for RAG evaluation.
from phoenix.evals import (
HALLUCINATION_PROMPT_TEMPLATE,
RAG_RELEVANCY_PROMPT_TEMPLATE,
llm_classify,
OpenAIModel,
)
# Evaluate document relevance per query
relevance = llm_classify(
dataframe=df_with_retrievals,
template=RAG_RELEVANCY_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4o"),
rails=["relevant", "irrelevant"],
)
# Evaluate hallucination per response
hallucination = llm_classify(
dataframe=df_with_responses,
template=HALLUCINATION_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4o"),
rails=["hallucinated", "factual"],
)
The RAG evaluators integrate with the tracing data. Phoenix can pull retrieval spans from traces and run relevance evaluation automatically.
Embeddings Analysis
Phoenix has an embeddings analysis view that visualizes input distributions, clusters them, and identifies outliers. Useful for understanding what your model is being asked.
from phoenix import Client
client = Client()
client.log_evaluations(...)
The embeddings view shows a 2D projection (UMAP or t-SNE) of your queries colored by quality scores. Clusters of low-quality queries are visible as colored regions. Drill into a cluster to see the examples.
This view is particularly useful for finding systematic failure modes. If a cluster of queries is consistently scoring low, the cluster usually corresponds to a topic your retriever or model handles poorly.
Online Evaluations
For production, run evaluators on a sample of live traces.
from phoenix.evals import run_evals
from phoenix.session.client import Client
client = Client()
traces = client.get_traces(project_name="prod", since="1d")
sample = traces.sample(0.1)
results = run_evals(
dataframe=sample,
evaluators=[hallucination_eval, relevance_eval],
)
client.log_evaluations(results)
Schedule the evaluation script (cron, Airflow) to run periodically. Results feed back into the Phoenix UI for trend tracking.
Self-Hosting
For production, run Phoenix in a Docker container or Kubernetes pod. The configuration is minimal.
# docker-compose.yml
services:
phoenix:
image: arizephoenix/phoenix:latest
ports:
- "6006:6006"
environment:
PHOENIX_WORKING_DIR: /data
volumes:
- phoenix-data:/data
volumes:
phoenix-data:
Persistent storage is required to keep traces and datasets across restarts. Configure backups for production.
Integration with CI
Run experiments on every PR.
# .github/workflows/evals.yml
name: Phoenix Evals
on: [pull_request]
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: python scripts/run_evals.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The script runs experiments and checks scores against thresholds. Exit code reflects pass/fail.
Comparison to Alternatives
| Platform | Open Source | Self-Host | RAG Metrics | Embeddings Analysis |
|---|---|---|---|---|
| Arize Phoenix | Yes | Yes | Yes | Yes |
| LangSmith | No | Enterprise | Via integrations | No |
| W&B Weave | Partial | Enterprise | Yes | No |
| Helicone | Yes | Yes | Limited | No |
| Langfuse | Yes | Yes | Yes | No |
Phoenix's embeddings analysis is unique among open-source platforms. The RAG-specific evaluators are mature and well-tested.
When to Choose Phoenix
Choose Phoenix if:
You need self-hosted observability and evaluation.
You build RAG and want first-class RAG metrics.
You want to analyze input distributions with embeddings.
You prefer OpenTelemetry standards.
You want to avoid per-trace pricing.
Avoid Phoenix if:
You want a managed service with zero ops.
You need LangChain-specific deep integration.
Your team is not comfortable with Python.
Setup Checklist
Install Phoenix and launch the UI.
Install the appropriate OTel instrumenters for your LLM libraries.
Verify traces appear in the UI.
Create a dataset from 50 production samples.
Run a built-in evaluator (QA, hallucination, relevance).
Inspect results in the experiment view.
Configure online evaluation for production sampling.
Self-host with Docker for production use.
Add the Phoenix URL to your team wiki.
Common Patterns
Pattern 1: laptop development. Phoenix runs locally for fast iteration. Developers see their traces immediately.
Pattern 2: shared team instance. One Phoenix instance for the team, shared dashboards.
Pattern 3: prod monitoring. Phoenix runs in the cloud with online evaluators on 10% of traffic.
Pattern 4: embeddings-driven dataset growth. Identify clusters of failures via embeddings analysis, add representative examples to the dataset, re-evaluate.
Common Pitfalls
Skipping the instrumenters. Manual span creation works but the auto-instrumenters cover more cases and require less code.
Forgetting persistent storage. Phoenix in Docker without a volume loses data on restart.
Untrusted evaluators. Calibrate built-in evaluator prompts on your data before trusting scores.
No backups. Trace and dataset storage is on you when self-hosting. Back up the volume.
Stale datasets. Update datasets from recent production traces; old datasets stop being representative.
Further Resources
- Phoenix documentation at docs.arize.com/phoenix.
- OpenInference instrumenters on GitHub.
- Browse LLM evaluation skills at /skills.
- Related guides on /blog: LangSmith Platform Guide, W&B Weave Guide, Helicone Monitoring Guide.
Conclusion
Arize Phoenix is the leading open-source LLM observability and evaluation platform in 2026. Self-hostable, OpenTelemetry-based, with rich evaluators and unique embeddings analysis. For teams that prioritize self-hosting and want a turnkey eval platform without subscription cost, Phoenix is the strongest choice. Start with local development, scale to cloud deployment, and integrate evaluations into CI. Browse /skills for related tools and the /blog for deeper guides.