Skip to main content
Skills Hub

Best LLM Evaluation Skills 2026

Curated skills for evaluating LLM-powered applications: OpenAI Evals, Promptfoo, Ragas, DeepEval, LangSmith, LangChain Evaluators, Helicone, Arize Phoenix, TruLens, Evidently AI, Weights & Biases. Install with a single command for Claude Code, Cursor, Copilot, and 30+ agents.

Install in 5 seconds:

# Install LLM eval skills
npx @qaskills/cli add promptfoo-llm-evals
npx @qaskills/cli add openai-evals
npx @qaskills/cli add ragas-rag-evals

Multi-provider

Promptfoo, OpenAI Evals, Ragas, DeepEval — covers OpenAI, Anthropic, local models.

RAG-ready

Context precision, recall, faithfulness, answer relevance metrics built in.

Red-teaming

Promptfoo red-team skill covers jailbreaks, prompt injection, PII leakage.

Top 0 Skills

Ranked by install count. All quality-scored 0-100.

No matching skills yet. Browse all skills or publish the first one.

Frequently Asked Questions

What is LLM evaluation?

Measuring how well an LLM-powered application performs across correctness, faithfulness, factuality, safety, latency, cost. Critical for production LLM apps.

OpenAI Evals or Promptfoo?

Promptfoo for multi-provider + CI + red-teaming. OpenAI Evals for OpenAI-only Python research workflows.

What is Ragas?

RAG-focused eval library. Measures context precision, context recall, faithfulness, answer relevance for retrieval-augmented generation pipelines.

Red teaming?

Adversarial testing — jailbreaks, prompt injection, PII leakage, hallucination probes. Promptfoo has first-class red-team support.

How do skills install?

Run `npx @qaskills/cli add <skill-name>`. CLI detects your agent and writes the skill to the correct path.

Free?

Yes — MIT licensed skills. The underlying tools (Promptfoo, Ragas, DeepEval) are also OSS.

Pytest integration?

DeepEval is pytest-style natively. Ragas and Promptfoo can run inside pytest with a custom fixture.

Ready to ship better tests?

Install your first skill in 5 seconds. Browse all 500+ skills or jump straight into the recommended starter.