Best LLM Evaluation Skills 2026
Curated skills for evaluating LLM-powered applications: OpenAI Evals, Promptfoo, Ragas, DeepEval, LangSmith, LangChain Evaluators, Helicone, Arize Phoenix, TruLens, Evidently AI, Weights & Biases. Install with a single command for Claude Code, Cursor, Copilot, and 30+ agents.
Install in 5 seconds:
# Install LLM eval skills
npx @qaskills/cli add promptfoo-llm-evals
npx @qaskills/cli add openai-evals
npx @qaskills/cli add ragas-rag-evalsMulti-provider
Promptfoo, OpenAI Evals, Ragas, DeepEval — covers OpenAI, Anthropic, local models.
RAG-ready
Context precision, recall, faithfulness, answer relevance metrics built in.
Red-teaming
Promptfoo red-team skill covers jailbreaks, prompt injection, PII leakage.
Top 3 Skills
Ranked by install count. All quality-scored 0-100.
RAG Regression Testing
by thetestingacademy
Gate RAG pipelines in CI with versioned golden eval sets, per-metric thresholds, baseline drift detection, and a build that fails when retrieval or answer quality regresses.
OpenAI Evals Trace Grading
by thetestingacademy
Grade LLM and agent traces with OpenAI Evals - build datasets, configure string/python/model graders, run eval suites, and gate agent behavior changes in CI.
RAG Evaluation Metrics
by thetestingacademy
Measure RAG pipeline quality with context precision/recall, faithfulness, answer relevancy, and groundedness using Ragas and DeepEval, with golden datasets and pass/fail thresholds.
Deep-Dive Articles
3000+ word references for each topic.
Frequently Asked Questions
What is LLM evaluation?
Measuring how well an LLM-powered application performs across correctness, faithfulness, factuality, safety, latency, cost. Critical for production LLM apps.
OpenAI Evals or Promptfoo?
Promptfoo for multi-provider + CI + red-teaming. OpenAI Evals for OpenAI-only Python research workflows.
What is Ragas?
RAG-focused eval library. Measures context precision, context recall, faithfulness, answer relevance for retrieval-augmented generation pipelines.
Red teaming?
Adversarial testing — jailbreaks, prompt injection, PII leakage, hallucination probes. Promptfoo has first-class red-team support.
How do skills install?
Run `npx @qaskills/cli add <skill-name>`. CLI detects your agent and writes the skill to the correct path.
Free?
Yes — MIT licensed skills. The underlying tools (Promptfoo, Ragas, DeepEval) are also OSS.
Pytest integration?
DeepEval is pytest-style natively. Ragas and Promptfoo can run inside pytest with a custom fixture.
Ready to ship better tests?
Install your first skill in 5 seconds. Browse all 500+ skills or jump straight into the recommended starter.