Skills Hub

Best LLM Evaluation Skills 2026

Curated skills for evaluating LLM-powered applications: OpenAI Evals, Promptfoo, Ragas, DeepEval, LangSmith, LangChain Evaluators, Helicone, Arize Phoenix, TruLens, Evidently AI, Weights & Biases. Install with a single command for Claude Code, Cursor, Copilot, and 30+ agents.

Install in 5 seconds:

# Install LLM eval skills
npx @qaskills/cli add promptfoo-llm-evals
npx @qaskills/cli add openai-evals
npx @qaskills/cli add ragas-rag-evals

Multi-provider

Promptfoo, OpenAI Evals, Ragas, DeepEval — covers OpenAI, Anthropic, local models.

RAG-ready

Context precision, recall, faithfulness, answer relevance metrics built in.

Red-teaming

Promptfoo red-team skill covers jailbreaks, prompt injection, PII leakage.

Top 7 Skills

Ranked by install count. All quality-scored 0-100.

RAG Regression Testing

by thetestingacademy

Gate RAG pipelines in CI with versioned golden eval sets, per-metric thresholds, baseline drift detection, and a build that fails when retrieval or answer quality regresses.

Promptfoo LLM Red Teaming

by thetestingacademy

Evaluate and red-team LLM applications with promptfoo, declarative YAML evals, assertions, model comparisons, and automated adversarial scans for prompt injection, jailbreaks, PII leaks, and unsafe outputs in CI.

Langfuse LLM Observability Testing

by thetestingacademy

Instrument LLM apps with Langfuse tracing, then use traces, scores, and datasets to test in production, run evaluations on real traffic, catch regressions, and close the loop from incident to golden dataset.

Ragas RAG Evaluation

by thetestingacademy

Evaluate RAG pipelines with Ragas, measuring faithfulness, answer relevancy, context precision and recall, building golden datasets, and wiring threshold gates into CI for retrieval regressions.

OpenAI Evals Trace Grading

by thetestingacademy

Grade LLM and agent traces with OpenAI Evals - build datasets, configure string/python/model graders, run eval suites, and gate agent behavior changes in CI.

DeepEval LLM Evaluation

by thetestingacademy

Test LLM applications with DeepEval, pytest-style unit tests for LLM outputs using G-Eval, answer relevancy, faithfulness, hallucination and custom metrics, with CI quality gates and dataset-driven regression runs.

RAG Evaluation Metrics

by thetestingacademy

Measure RAG pipeline quality with context precision/recall, faithfulness, answer relevancy, and groundedness using Ragas and DeepEval, with golden datasets and pass/fail thresholds.

Deep-Dive Articles

3000+ word references for each topic.

OpenAI Evals Complete Guide 2026 OpenAI Evals Best Practices Promptfoo Complete Guide 2026 Promptfoo Red Teaming LLM Applications Ragas RAG Evaluation Metrics Ragas Context Precision/Recall/Faithfulness LLM Evals: OpenAI vs Promptfoo vs Ragas DeepEval pytest LLM Testing LangSmith Evaluation Platform Guide LangChain Evaluators Complete Guide Arize Phoenix LLM Evaluation TruLens LLM Evaluation Framework

Frequently Asked Questions

What is LLM evaluation?

Measuring how well an LLM-powered application performs across correctness, faithfulness, factuality, safety, latency, cost. Critical for production LLM apps.

OpenAI Evals or Promptfoo?

Promptfoo for multi-provider + CI + red-teaming. OpenAI Evals for OpenAI-only Python research workflows.

What is Ragas?

RAG-focused eval library. Measures context precision, context recall, faithfulness, answer relevance for retrieval-augmented generation pipelines.

Red teaming?

Adversarial testing — jailbreaks, prompt injection, PII leakage, hallucination probes. Promptfoo has first-class red-team support.

How do skills install?

Run `npx @qaskills/cli add <skill-name>`. CLI detects your agent and writes the skill to the correct path.

Free?

Yes — MIT licensed skills. The underlying tools (Promptfoo, Ragas, DeepEval) are also OSS.

Pytest integration?

DeepEval is pytest-style natively. Ragas and Promptfoo can run inside pytest with a custom fixture.

Ready to ship better tests?

Install your first skill in 5 seconds. Browse all 500+ skills or jump straight into the recommended starter.

Browse 500+ Skills Start with Promptfoo LLM Evals