Best LLM Evaluation Skills 2026
Curated skills for evaluating LLM-powered applications: OpenAI Evals, Promptfoo, Ragas, DeepEval, LangSmith, LangChain Evaluators, Helicone, Arize Phoenix, TruLens, Evidently AI, Weights & Biases. Install with a single command for Claude Code, Cursor, Copilot, and 30+ agents.
Install in 5 seconds:
# Install LLM eval skills
npx @qaskills/cli add promptfoo-llm-evals
npx @qaskills/cli add openai-evals
npx @qaskills/cli add ragas-rag-evalsMulti-provider
Promptfoo, OpenAI Evals, Ragas, DeepEval — covers OpenAI, Anthropic, local models.
RAG-ready
Context precision, recall, faithfulness, answer relevance metrics built in.
Red-teaming
Promptfoo red-team skill covers jailbreaks, prompt injection, PII leakage.
Top 0 Skills
Ranked by install count. All quality-scored 0-100.
No matching skills yet. Browse all skills or publish the first one.
Deep-Dive Articles
3000+ word references for each topic.
Frequently Asked Questions
What is LLM evaluation?
Measuring how well an LLM-powered application performs across correctness, faithfulness, factuality, safety, latency, cost. Critical for production LLM apps.
OpenAI Evals or Promptfoo?
Promptfoo for multi-provider + CI + red-teaming. OpenAI Evals for OpenAI-only Python research workflows.
What is Ragas?
RAG-focused eval library. Measures context precision, context recall, faithfulness, answer relevance for retrieval-augmented generation pipelines.
Red teaming?
Adversarial testing — jailbreaks, prompt injection, PII leakage, hallucination probes. Promptfoo has first-class red-team support.
How do skills install?
Run `npx @qaskills/cli add <skill-name>`. CLI detects your agent and writes the skill to the correct path.
Free?
Yes — MIT licensed skills. The underlying tools (Promptfoo, Ragas, DeepEval) are also OSS.
Pytest integration?
DeepEval is pytest-style natively. Ragas and Promptfoo can run inside pytest with a custom fixture.
Ready to ship better tests?
Install your first skill in 5 seconds. Browse all 500+ skills or jump straight into the recommended starter.