How is testing different for agentic AI systems?

Traditional tests assert an exact output for an exact input. Agentic systems are probabilistic; the same prompt can yield different valid responses. So QA shifts from deterministic assertions to evals: golden datasets, rubric scoring, and LLM-as-judge suites that grade whether behaviour stays inside an acceptable boundary. This is the Eval phase of Pooya Golchian's AIDLC method.

What is the QA engineer's role when agents write the code?

More important, not less. The QA engineer designs the eval harness that gates every change, decides which behaviours must be guarded, and curates the golden datasets that catch regressions. In AIDLC, evals are the safety net that lets agent speed translate into shipped systems instead of faster bugs.

What is an eval and how is it different from a unit test?

A unit test asserts an exact value; an eval grades a probabilistic output against a rubric or a golden reference, often using LLM-as-judge scoring. Evals also test the agent's decisions, not just its output, did it call the right tool, in the right order, and stop when it had enough information. That is what matters in agentic systems.

When do evals run in an agentic development workflow?

On every change, the way unit tests used to, gating each diff the agent generates. In Pooya Golchian's AIDLC method evals are not an end-of-cycle step; they run in CI so a behavioural regression gets caught before it reaches a user, and the QA engineer owns that gate.

How the QA Engineer Role Changed Under Agentic AI Development (AIDLC 2026)

Deterministic test suites cannot judge probabilistic systems, so the QA engineer's job moved from asserting exact outputs to designing evals. When the same prompt can return two correct answers, a string assertion fails good responses and passes bad ones; the new craft is grading behaviour against a golden reference.

QA had a clear contract with the codebase. Given this input, assert that output. Cover the edge cases, automate the regression suite, and a green build meant the behaviour you tested still held.

That contract assumes determinism. Agentic systems break it. Ask an agent the same question twice and you may get two different, both-correct answers. A test that asserts an exact string now fails on a perfectly good response, and a test that asserts nothing meaningful passes on a bad one.

From assertions to evals

In the AIDLC method, this is the Eval phase, and it is where QA's leverage exploded. Instead of asserting exact outputs, the QA engineer designs evals: golden datasets of input-and-acceptable-output pairs, rubric-based scoring, and LLM-as-judge suites that grade whether a response stays inside the boundary the spec defined.

This is harder and more valuable than writing assertions. Deciding what "good enough" means for a probabilistic system, encoding it so a machine can grade it, and keeping the dataset honest as the product evolves is a senior discipline. It is the difference between a system that ships confidently and one that regresses silently.

The new gate

Evals are not a phase you run at the end. In AIDLC they run on every change, the same way unit tests used to, gating each diff the agent generates. A behavioural regression that would have reached a user gets caught in CI instead. The QA engineer owns that gate, and owning it makes them one of the highest-leverage people on the team.

They also test the agent's decisions, not just its output. Did it call the right tool, in the right order, and stop when it had enough information? Those are the failures that matter in agentic systems, and catching them is the new craft.

If your team ships agent features with no eval suite, you are running probabilistic software with a deterministic safety net that no longer fits. The mismatch is where your incidents come from.

The QA engineers who win

They build eval harnesses, not just test suites. They curate golden datasets like an asset. They grade behaviour and decisions, not just strings. And they measure their work in regressions caught before release, which is the only metric that survives the move to probabilistic systems.

AI Engineering for B2B

Testing a probabilistic AI system with a deterministic test suite?

I join your engineering team and build the agent layer alongside you, covering architecture, MCP integration, evals, and production deployment. When the engagement ends, your team owns the system and keeps shipping.

12+ years shipping production systems

Senior engineer turned AI specialist. React, Next.js, AWS, agent orchestration.

Dubai-based, working with B2B teams worldwide

Direct collaboration across UAE, Europe, and US time zones.

AI agent teams that ship, not demos that stall

Discovery, role design, MCP integration, evals, and production deployment.

Book an embedded engagementOr email pooya@pooyagolchian.com to scope a project.

If you want an eval harness that gates agent behaviour the way tests used to gate code, book a discovery call and we will design it.

The QA Engineer in the Agentic Era: From Writing Tests to Designing Evals

From assertions to evals

The new gate

The QA engineers who win

Testing a probabilistic AI system with a deterministic test suite?

Stuck between an AI pilot and a system your team can run?

About Pooya Golchian

Newsletter

From assertions to evals

The new gate

The QA engineers who win

Testing a probabilistic AI system with a deterministic test suite?

Stuck between an AI pilot and a system your team can run?

About Pooya Golchian

Get practical AI and engineering playbooks

Newsletter