AI Testing

2026-05-13

TruLens LLM Evaluation Framework Complete Guide 2026

Master TruLens for LLM evaluation in 2026. Feedback functions, the RAG triad, multi-turn evaluation, dashboards, and integration with LangChain and LlamaIndex.

TruLens LLM Evaluation Framework Complete Guide 2026

TruLens is the open-source LLM evaluation framework from TruEra, focused on the concept of feedback functions: programmatic checks that score LLM outputs along quality dimensions. By 2026 TruLens is the framework of choice for teams that want maximum flexibility in defining quality. The framework popularized the RAG triad (context relevance, groundedness, answer relevance) which has become standard vocabulary in the RAG evaluation space.

This guide covers TruLens end to end: installation, feedback functions, the RAG triad, multi-turn evaluation, integration with LangChain and LlamaIndex, dashboards, and how TruLens compares to Ragas, OpenAI Evals, and other frameworks. Python samples cover every key API. By the end you should be able to instrument any LLM application with TruLens and use the resulting feedback to improve quality. The guide assumes basic Python familiarity and an existing LLM application.

Key Takeaways

TruLens centers on feedback functions, customizable programmatic checks that score LLM outputs.
The RAG triad (context relevance, groundedness, answer relevance) is TruLens's signature methodology.
Feedback functions can use any provider as a judge: OpenAI, Anthropic, HuggingFace, even local models.
The dashboard provides per-record scores, aggregate trends, and the ability to drill into failing examples.
Integration with LangChain, LlamaIndex, and arbitrary Python is supported via context managers.
TruLens excels at custom feedback dimensions; out of the box it has fewer built-in metrics than Ragas.

Why TruLens

TruLens differentiates on flexibility. Other frameworks ship a fixed set of metrics; TruLens ships a feedback function system that lets you define any metric. The same framework that scores faithfulness for a RAG app can score code quality for a code-gen app or accuracy for a math tutor.

This flexibility is a tradeoff. Teams that want plug-and-play metrics may prefer Ragas (for RAG specifically) or LangChain evaluators (for LangChain integration). Teams that want to define custom quality dimensions in Python prefer TruLens.

The RAG triad popularized by TruLens has become standard. Even teams that do not use TruLens often think in those terms: context relevance, groundedness, answer relevance.

Installation

pip install trulens openai
export OPENAI_API_KEY=sk-...

Optional extras for specific providers:

pip install trulens-providers-openai trulens-providers-huggingface trulens-providers-cortex

Initialize TruLens at the start of your application.

from trulens.core import TruSession

session = TruSession()

This creates a local SQLite database to store records and feedback. For production, configure a Postgres backend.

Feedback Functions

A feedback function is a Python function that takes inputs and returns a score. TruLens provides a standard library plus a system for custom functions.

from trulens.core.feedback import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

f_relevance = Feedback(provider.relevance).on_input_output()

The Feedback object wraps a callable. The on_input_output() chain tells TruLens to apply the function to the input and output of each record.

Custom feedback functions:

def has_citation(output: str) -> float:
    return 1.0 if "[" in output and "]" in output else 0.0

f_citation = Feedback(has_citation).on_output()

Any function that returns a float between 0 and 1 is a valid feedback function. TruLens calls the function on every record and stores the score.

The RAG Triad

The RAG triad measures three dimensions of RAG quality.

from trulens.core.feedback import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

# Context relevance: are retrieved chunks relevant to the question?
f_context_relevance = (
    Feedback(provider.context_relevance, name="Context Relevance")
    .on_input()
    .on(context_selector)
    .aggregate(np.mean)
)

# Groundedness: is the answer supported by the context?
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(context_selector.collect())
    .on_output()
)

# Answer relevance: does the answer address the question?
f_answer_relevance = (
    Feedback(provider.relevance, name="Answer Relevance")
    .on_input_output()
)

feedbacks = [f_context_relevance, f_groundedness, f_answer_relevance]

The context_selector is a JSON path that points TruLens at the retrieved chunks within your application's data structure. TruLens supports complex selectors for nested data.

The triad covers the three failure modes of a RAG system: bad retrieval, bad grounding, irrelevant answers. Together they give a complete picture of RAG quality.

Recording

Wrap your application to record inputs, outputs, and intermediate state.

from trulens.apps.langchain import TruChain

tru_chain = TruChain(
    my_langchain_chain,
    app_id="customer-support-v1",
    feedbacks=feedbacks,
)

with tru_chain as recording:
    response = my_langchain_chain.invoke({"question": "How do I rotate my API key?"})

The recording captures the question, the retrieved contexts, the generated answer, and applies all feedback functions. Results stream to the TruLens dashboard.

For non-LangChain applications, use TruCustomApp or TruBasicApp.

from trulens.apps.custom import TruCustomApp

class MyApp:
    @instrument
    def retrieve(self, query: str) -> list[str]:
        ...

    @instrument
    def generate(self, query: str, contexts: list[str]) -> str:
        ...

    @instrument
    def query(self, q: str) -> str:
        contexts = self.retrieve(q)
        return self.generate(q, contexts)

app = MyApp()
tru_app = TruCustomApp(app, app_id="custom-v1", feedbacks=feedbacks)

with tru_app as recording:
    response = app.query("How do I rotate my API key?")

Dashboard

Launch the TruLens dashboard.

from trulens.dashboard import run_dashboard

run_dashboard()

The dashboard opens in your browser (default port 8484). Views include:

App leaderboard: compare all your apps by feedback scores.

Records: list of all recorded interactions with scores per feedback.

Evaluation: drill into a single record to see the full trace and feedback scoring.

Quality vs cost: scatterplot of feedback scores against latency and cost.

The dashboard is the primary review interface. Most teams put the URL in their wiki and check it daily.

Multi-Turn Evaluation

For multi-turn conversations and agents, TruLens records the full trajectory.

from trulens.apps.custom import TruCustomApp

@instrument
def chat(self, message: str, history: list[dict]) -> str:
    # ...
    return response

# Each turn is recorded separately but linked by app_id

The dashboard shows multi-turn conversations as threads. Feedback applies per turn; you can also define aggregate feedback that runs over the full conversation.

Providers

TruLens supports many providers for the judge model.

from trulens.providers.openai import OpenAI
from trulens.providers.huggingface import Huggingface
from trulens.providers.cortex import Cortex
from trulens.providers.litellm import LiteLLM

openai_provider = OpenAI(model_engine="gpt-4o")
hf_provider = Huggingface()
cortex_provider = Cortex(model_engine="llama3.1-70b")
litellm_provider = LiteLLM(model_engine="claude-3.5-sonnet")

The LiteLLM provider gives access to dozens of LLMs through a single interface. Use a strong judge (GPT-4 class or equivalent) for reliable feedback.

Provider	Strength
OpenAI	Default, well-tested
Anthropic via LiteLLM	Equivalent to OpenAI
Huggingface	Open-source judges
Cortex (Snowflake)	Enterprise data isolation
AWS Bedrock	AWS-native

Aggregating Feedback

Aggregate feedback over multiple records by mean, median, percentile, or custom aggregator.

import numpy as np

f_groundedness = (
    Feedback(provider.groundedness_measure)
    .on(context_selector.collect())
    .on_output()
    .aggregate(np.mean)
)

Aggregation matters for retrieval-related feedback because each query has multiple retrieved chunks. The aggregator decides how to combine per-chunk scores into a single per-query score.

Comparison to Alternatives

Framework	Custom Feedback	Built-in Metrics	Dashboard	LangChain Integration
TruLens	Yes (core feature)	Limited	Yes	Native
Ragas	Custom metrics	Comprehensive	Via export	Adapter
OpenAI Evals	Yes	Comprehensive	Yes	Adapter
Phoenix	Yes	Comprehensive	Yes	Yes
LangChain Evaluators	Yes	Comprehensive	Via LangSmith	Native

TruLens wins on flexibility. Ragas wins on out-of-the-box RAG metrics. Both can complement each other.

When to Choose TruLens

Choose TruLens if:

You want maximum flexibility in defining quality.

The RAG triad methodology fits your application.

You build on LangChain or LlamaIndex.

You want a self-hosted, open-source platform.

Avoid TruLens if:

You want ready-to-use RAG metrics with no configuration; Ragas is faster to start.

You build agents with complex tool calls; OpenAI Evals has better agent support.

You need a managed service; TruLens is open source and self-hosted only.

Production Setup

For production, configure persistent storage and run the dashboard as a service.

from trulens.core import TruSession

session = TruSession(
    database_url="postgresql://user:pass@host/trulens_db",
)

The Postgres backend supports concurrent writes and survives restarts.

Run the dashboard as a long-running process behind a reverse proxy. Configure authentication via your proxy.

CI Integration

Run feedback evaluation in CI on every PR.

# test_quality.py
import pytest
from trulens.core import TruSession

THRESHOLDS = {"Context Relevance": 0.7, "Groundedness": 0.85, "Answer Relevance": 0.8}

def test_quality():
    session = TruSession(database_url=":memory:")
    # ... run app with TruLens
    leaderboard = session.get_leaderboard(app_ids=["customer-support-v1"])
    for metric, threshold in THRESHOLDS.items():
        assert leaderboard[metric].mean() >= threshold

The in-memory database avoids polluting your production trulens DB. Use a fresh DB per CI run.

Common Patterns

Pattern 1: RAG triad as default. Every RAG app starts with the three triad feedbacks. Add custom feedbacks as needed.

Pattern 2: per-app feedback. Different apps need different feedbacks. A summarization app needs conciseness; a translation app needs fluency.

Pattern 3: feedback evolution. Start with a few feedbacks; add more as you find failure modes.

Pattern 4: dashboard-driven reviews. Weekly team review of the dashboard. Top failing records get triaged.

Common Pitfalls

Weak judge. Using GPT-3.5 or weaker for feedback judges produces noisy scores. Use GPT-4 class.

Vague feedback definitions. A feedback prompt like "is the response good?" produces noise. Define criteria.

Untrusted scores. Calibrate feedback functions against human judgments before trusting.

Ignoring the dashboard. Data is collected but if nobody looks, it does not help.

Mixing dev and prod. Separate sessions for dev and prod; mixing pollutes both.

Migrating from Other Frameworks

If you currently use Ragas and want to move to TruLens:

The RAG triad in TruLens covers similar dimensions to Ragas's faithfulness, context relevance, and answer relevance.

Custom Ragas metrics rewrite as TruLens feedback functions.

Dataset formats convert easily.

If you currently use TruLens and want to move to Ragas:

Ragas has more built-in RAG metrics out of the box.

Custom TruLens feedbacks become Ragas custom metrics.

The migration is straightforward for standard RAG metrics; custom feedbacks need rewriting.

Further Resources

TruLens documentation and examples.
Ragas comparison at /blog (LLM Evals Comparison guide).
Browse LLM evaluation skills at /skills.

Conclusion

TruLens is the framework of choice when flexibility matters more than out-of-the-box metrics. The feedback function system lets you define any quality dimension, the RAG triad provides a starting methodology, and the dashboard makes results visible to the team. For teams committed to LangChain or LlamaIndex who want to define custom quality dimensions, TruLens is a strong choice. Browse /skills for related evaluation tools and the /blog for deeper guides.