AI Evals

2026-06-15

GAIA Benchmark Explained: AI Agent Evaluation (2026)

Q: How do I access and run the GAIA dataset?

GAIA is openly available but gated on Hugging Face at gaia-benchmark/GAIA. You log in with huggingface-cli login, accept the dataset terms, then load it with the datasets library. Iterate over items, download any attached file referenced by file_name, run your own tool-using agent to produce a final answer string, and score it against the validation answers with the GAIA exact-match scorer.

The GAIA benchmark explained for 2026: real-world questions easy for humans but hard for AI. Difficulty levels, exact-match scoring, and how to run it.

GAIA Benchmark Explained: AI Agent Evaluation (2026)

GAIA (General AI Assistants) is a benchmark for testing whether an AI assistant can answer real-world questions that are conceptually simple for humans but hard for AI. Each question has a single, unambiguous factual answer, yet solving it usually requires a combination of capabilities — multi-step reasoning, web browsing, reading attached files (images, spreadsheets, PDFs, audio), tool use, and code execution. Answers are scored by exact match after normalization, so grading is cheap and objective. The headline result is the gap: humans score around 92% while a plain LLM without tools scores far lower. Closing that gap is the whole point.

This guide explains GAIA's design philosophy, the structure of a GAIA item, the three difficulty levels, exact-match scoring (and why answer format matters), the validation-versus-test split, and how to run the benchmark against your own agent. GAIA's dataset and leaderboard evolve, so treat the code here as accurate usage patterns and confirm exact field names and the current leaderboard rules against the official GAIA paper and dataset card before pinning anything.

What GAIA actually measures

Most LLM benchmarks chase difficulty by asking questions that are hard for humans too — graduate math, obscure trivia, expert exams. GAIA deliberately inverts that. It was introduced by researchers from Meta-FAIR, Hugging Face, and the AutoGPT community with a different thesis: a benchmark that is "conceptually simple for humans yet challenging for most advanced AIs" is a better measure of a general assistant, because the things humans find trivial (look something up, open the attached file, do the arithmetic, check a second source) are exactly the things current models still fumble.

A GAIA question reads like something you'd actually ask an assistant. "In the attached spreadsheet, what was the total revenue for the products that shipped in Q3?" "How many studio albums did this artist release between two dates, according to a specific source?" A capable person solves these in a minute or two with a browser and the file. A bare language model — no browsing, no file reading, no calculator — often can't, because the answer isn't in its weights; it's in a live artifact it has to go fetch and process.

That design has two consequences worth internalizing:

It is resistant to memorization. Because answers depend on reading real files and querying live sources through tools, you can't ace GAIA by having seen the answer during pretraining. The benchmark tests the act of finding and assembling the answer, not recall.
It rewards real agents, not raw models. A high GAIA score requires a scaffold with browsing, file reading, and code execution wired up. The model is necessary but not sufficient. For a conceptual primer on how agent scoring works across tools, see our AI agent evaluation overview.

The human-versus-AI gap

The original GAIA paper reported that human respondents score about 92% on the benchmark, while GPT-4 equipped with plugins scored roughly 15% at the time of release. That spread is the headline finding, and it is intentional. The questions are designed so that a competent human nearly always succeeds, which means any score well below human level is a clean signal of missing assistant capability — usually a tool the agent didn't use, a file it didn't read, or a step it skipped.

Two cautions when you cite numbers. First, the ~92% human figure is approximate and comes from the paper; treat it as a reference point, not a precise constant. Second, model-and-scaffold scores have moved a lot since release and live on the public leaderboard — so read the current leaderboard rather than quoting a stale figure for "GPT-4" or any specific system. The methodology is the durable part; the scoreboard changes weekly.

Anatomy of a GAIA item

Every GAIA task is small and rigid by design. A single item has three parts:

A question — a natural-language prompt with one correct, factual answer.
Optional attached file(s) — referenced by a file_name field. The attachment can be an image, a spreadsheet (XLSX/CSV), a PDF, an audio clip, code, or another artifact you must download and read to answer.
A single ground-truth answer string — short and unambiguous: a number, a name, a date, a short phrase, or a small list. There is no essay, no rubric, no partial credit.

In the dataset, items are stored as JSON Lines (historically metadata.jsonl, with Parquet also available in recent revisions). A simplified record looks like this:

{
  "task_id": "c61d22de-5f6c-4958-a7f6-5e9707bd3466",
  "Question": "In the attached spreadsheet, what is the total revenue for products that shipped in Q3? Express your answer in USD with no symbols or commas.",
  "Level": 2,
  "file_name": "q3_orders.xlsx",
  "Final answer": "184250"
}

The shape matters: a Question, a difficulty Level, a file_name that may be empty, and a Final answer string present in the validation split but withheld in the test split. Your agent's job is to produce a final answer string that matches Final answer exactly under normalization.

The three difficulty levels

GAIA sorts every question into one of three levels by how many steps and tools a competent solver needs. The levels are the most useful lens for reading results, because they separate "the model can follow one instruction" from "the agent can run a long, multi-tool plan."

Level	Steps	Tools required	Feels like	Example shape
Level 1	Few (often one)	Usually none or a single tool	A quick lookup or single-file read	"Read this image and report the value on the gauge."
Level 2	Several	Multiple tools combined	A short research task	"Find a figure on the web, then compute a value from the attached spreadsheet."
Level 3	Many, long-horizon	Long sequences of tools and actions	A real general-assistant task	"Browse multiple sources, reconcile them, read two attachments, and derive a final number."

The progression is the point. Level 1 should be breakable by a very good LLM with minimal tooling — short reasoning, maybe one tool. Level 2 demands chaining several steps and more than one tool, so a model with no browsing or no file reading will start to fail here. Level 3 is the strong jump: long, complex sequences of actions across multiple tools, much closer to what a capable human assistant does over many minutes. Reporting one blended accuracy hides all of this; per-level accuracy tells you where your agent breaks.

Scoring: exact match, and why format wins or loses points

GAIA scores answers with exact match after normalization — sometimes called quasi-exact match. The scorer takes your agent's final answer string and the ground-truth string, normalizes both, and checks equality. There is no LLM-as-judge and no partial credit. This is a deliberate trade: because every answer is a short factoid, a deterministic string comparison is cheap, objective, and reproducible — no grader model, no rubric drift, no flaky judgments.

Normalization typically smooths over differences that don't change the answer's meaning:

Case — Paris vs paris.
Surrounding punctuation and whitespace — trailing periods, stray spaces.
Number formatting — commas and currency symbols ($1,234 vs 1234), so you usually output the bare number.
List formatting — for list answers, the separator and ordering conventions the question specifies.

Here is the crucial, often-missed implication: a correct answer in the wrong format still fails. If the question says "answer in USD with no symbols or commas" and your agent returns $184,250, exact match marks it wrong even though the reasoning was perfect. In practice a large share of agent failures on GAIA are formatting failures, not reasoning failures — the agent found the right value but emitted it in a shape the scorer doesn't accept.

The practical defenses are simple and they belong in your scaffold:

Read the format instruction in the question and obey it literally (units, decimals, symbols, casing).
Have the agent emit a clearly delimited final answer — for example a trailing FINAL ANSWER: <value> line you parse — so you submit the answer and nothing else.
Strip the chatter. Submit the answer string only, never "The answer is 184250 because…".

A conceptual exact-match check, mirroring how the official scorer behaves:

import re

def normalize(s: str) -> str:
    s = s.strip().lower()
    s = s.replace(",", "").replace("$", "")
    s = re.sub(r"[.\s]+$", "", s)   # drop trailing punctuation/space
    return s

def exact_match(prediction: str, gold: str) -> bool:
    return normalize(prediction) == normalize(gold)

# Right value, wrong format -> still passes after normalization
assert exact_match("$184,250", "184250")
# Extra prose -> fails: submit ONLY the answer string
assert not exact_match("The total is 184250 USD", "184250")

The second assertion is the lesson: extract the final answer and submit it alone.

Validation set vs. test set

GAIA ships as two splits, and the difference is central to keeping the benchmark honest.

Validation (dev) set — public, with ground-truth answers. You use this to develop and debug your agent: run it locally, score with the GAIA exact-match scorer, read per-level accuracy, and iterate. The validation set is on the order of ~165 questions.
Test set — public questions but private, withheld answers (on the order of ~300 questions). You cannot score it yourself. Instead you generate predictions and submit them to the GAIA leaderboard, which scores them against the held-out answers server-side.

The reason for withholding test answers is contamination resistance. If the test answers were public, they would eventually leak into training data and inflate scores without reflecting real capability. By keeping them private and scoring through a leaderboard (hosted as a Hugging Face Space), GAIA preserves a clean signal: your test number reflects an agent solving unseen questions, not a model regurgitating answers it absorbed.

The workflow that follows from this split: tune on validation, report on test. Use the validation set freely during development; submit to the leaderboard only when you want an official, contamination-resistant number.

Running GAIA against your own agent

GAIA is openly available — but gated — on Hugging Face at gaia-benchmark/GAIA. You must be logged in and accept the dataset's terms (which include not resharing it outside a gated or private repository) before you can download it. Skipping this step is the single most common "it won't load" error.

# 1. Authenticate so the gated dataset is accessible
pip install datasets huggingface_hub
huggingface-cli login   # paste a token from your HF account

Then load a split and iterate over items, downloading any attached file:

from datasets import load_dataset

# Load the validation split (has ground-truth answers)
ds = load_dataset("gaia-benchmark/GAIA", "2023_all", split="validation")

for item in ds:
    question = item["Question"]
    level = item["Level"]
    gold = item["Final answer"]        # present on validation only

    # Some items attach a file you MUST read to answer
    attachment_path = item.get("file_name") or None

    # Hand the question + any file to YOUR agent / scaffold
    prediction = my_agent.run(question, file=attachment_path)

    correct = exact_match(prediction, gold)
    print(f"L{level} | correct={correct} | pred={prediction!r} gold={gold!r}")

The line that does the real work is my_agent.run(...). GAIA does not ship an agent — it ships questions, files, and answers. You bring the scaffold, and to score above the floor it needs, at minimum:

Web browsing / search for live facts.
File reading for images, spreadsheets, PDFs, and audio (OCR, table parsing, transcription as needed).
A code/Python tool for arithmetic, data manipulation, and parsing.
A multi-step planning loop so the model can call tools, read results, and act again until it has a final answer.

Score the validation split locally with the exact-match scorer, inspect failures per level, and only then generate predictions for the test split and submit them to the leaderboard. Predictions are typically submitted as a JSON Lines file mapping each task_id to your model's answer:

{"task_id": "c61d22de-5f6c-4958-a7f6-5e9707bd3466", "model_answer": "184250"}
{"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "model_answer": "Tokyo"}

If you maintain a tool-using agent and want it wired for browsing, file reading, and code execution out of the box, install-ready scaffolds and evaluation skills for AI coding agents are at /skills.

Interpreting your results

A single aggregate accuracy is the least useful number GAIA gives you. Read it in layers:

Per-level accuracy first. If Level 1 is strong but Level 2 and 3 collapse, your reasoning is fine and your tooling or planning loop is the bottleneck — the agent isn't chaining steps or isn't using a needed tool. If even Level 1 is weak, suspect format mismatches or a broken file-reading path.
Measure against the human bar, not zero. ~92% human accuracy is the ceiling that gives the score meaning. "40%" is abstract; "40% where humans get ~92%" tells you how much assistant capability is still missing.
Remember a high score needs a real agent. A bare model with no tools will score low by construction. GAIA is testing the system, so a good number is evidence your scaffold works, not just that the model is smart.
Beware comparing scaffolds. Two GAIA numbers are only comparable if the agents had similar tools and constraints. A result with unlimited browsing and a strong code tool is not comparable to a single bare model call. When you read the leaderboard, look at how each entry was built before ranking them. The same care applies when you line up any eval tools side by side — see our evaluation tool comparison hub.

Common pitfalls and troubleshooting

Gated-access errors loading the dataset. load_dataset("gaia-benchmark/GAIA") fails until you log in and accept the terms on the dataset page. Run huggingface-cli login and click "Agree" on the dataset card first.
Ignoring attachments. Many items are unsolvable from the question text alone — the answer lives in the attached spreadsheet, image, PDF, or audio. If your accuracy is near zero on file-bearing items, your agent isn't downloading or reading the file_name artifact.
Answer-format mismatches. The most common silent failure: the agent computes the right value but emits it with a currency symbol, commas, extra words, or the wrong casing, and exact match rejects it. Follow the question's format instruction literally and submit only the final answer string.
Browsing and tool reliability. Level 2 and 3 items chain many live actions; a flaky browser, a rate-limited search, or a tool that times out mid-plan will tank long-horizon questions. Add retries and per-step timeouts, and inspect transcripts to see where the loop stalled.
Scoring on the test split locally. You can't — test answers are withheld. Develop and score on validation; use the leaderboard for the test number.
Over-reading a single run. Agents are stochastic. If a Level 3 result swings between runs, average several attempts before drawing conclusions, and lower temperature for steps that should be deterministic.

When GAIA is the right benchmark

Reach for GAIA when you are building or evaluating a general-purpose, tool-using assistant and want one objective, contamination-resistant yardstick for "can it actually get real-world answers?" Its strengths are a clean human baseline, cheap deterministic scoring, and difficulty levels that localize where an agent breaks. It is not the right tool for grading open-ended generation (summaries, tone, creativity) — those need rubric or LLM-as-judge scoring — and it won't replace a domain-specific suite for your own product's tasks. Many teams use GAIA as a capability smoke test for their agent scaffold alongside their own private evals. Browse install-ready evaluation and agent skills for AI coding agents at /skills.

Frequently Asked Questions

What is the GAIA benchmark?

GAIA (General AI Assistants) is a benchmark of real-world questions that are conceptually simple for humans but hard for AI, because each one requires a combination of reasoning, web browsing, reading attached files, tool use, and code execution. Every question has a single unambiguous factual answer scored by exact match after normalization. It was introduced by researchers from Meta-FAIR, Hugging Face, and the AutoGPT community to measure general-assistant capability rather than human-level difficulty.

How is the GAIA benchmark scored?

GAIA uses exact match after normalization, sometimes called quasi-exact match. The scorer normalizes case, surrounding punctuation, number formatting (commas and currency symbols), and list formatting, then checks whether your agent's final answer string equals the ground-truth string. There is no partial credit and no LLM judge, which makes scoring cheap and objective — but it also means a correct value in the wrong format fails.

What are the three levels in GAIA?

Level 1 questions take few steps and usually no tool or a single tool, so a very good LLM can often solve them. Level 2 questions need several steps and multiple tools combined, such as browsing plus reading a spreadsheet. Level 3 questions require long, complex sequences of actions across many tools, much closer to a real general assistant working over many minutes. Reading accuracy per level shows exactly where an agent breaks.

Why do humans score so much higher than AI on GAIA?

The questions are deliberately designed to be easy for a competent person — look something up, open the attached file, do the arithmetic — while still requiring an AI to actively use tools it often doesn't. Humans score about 92% in the original paper, while a plain model without tools scores far lower. That gap is the benchmark's signal: it shows how much real assistant capability is still missing.

How do I access and run the GAIA dataset?

GAIA is openly available but gated on Hugging Face at gaia-benchmark/GAIA. You log in with huggingface-cli login, accept the dataset terms, then load it with the datasets library. Iterate over items, download any attached file referenced by file_name, run your own tool-using agent to produce a final answer string, and score it against the validation answers with the GAIA exact-match scorer.

What is the difference between the GAIA validation and test sets?

The validation (dev) set is public and includes ground-truth answers, so you use it to develop and score your agent locally. The test set has public questions but private, withheld answers, so you submit predictions to the GAIA leaderboard (a Hugging Face Space) for server-side scoring. Test answers are withheld to prevent contamination — keeping them out of training data preserves a clean, honest measure of capability on unseen questions.