Evaluation & Testing | AI Agents Course

The Non-Determinism Problem

Normal software is deterministic: add(2, 2) always returns 4.

Agents are non-deterministic. All of these are correct responses to "What's 2+2?":

"4"
"The answer is 4"
"2+2 equals 4"
"Four"

Traditional assert output == expected tests fail.

What Can Go Wrong

Agents can fail in many ways:

Wrong tool selection: Called the wrong tool
Wrong arguments: Called the right tool with wrong parameters
Bad reasoning: Made incorrect conclusions from tool results
Safety issues: Generated harmful content
Efficiency: Used too many tokens or API calls

Testing Strategy: Instead of testing exact outputs, test:

Behavior: Did it call the right tools?
Structure: Is the output in the right format?
Quality: Is the answer helpful? (using another LLM)
Safety: Is the response appropriate?
Efficiency: Token and API call counts

TestModel - Deterministic Testing

PydanticAI provides TestModel for testing without making real API calls. Useful for:

Verifying tool availability
Testing agent configuration
Fast unit tests (no API calls)

test_model.py

from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel

agent = Agent(
    'openai:gpt-4o',
    instructions='Help with calculations.'
)

@agent.tool_plain
def calculate(expression: str) -> str:
    return str(eval(expression))

def test_agent_has_calculate_tool():
    """Test that the calculate tool is available."""
    test_model = TestModel()

    result = agent.run_sync(
        'What is 2+2?',
        model=test_model  # Use test model instead of real LLM
    )

    # Check that tool is available
    tools = test_model.last_model_request_parameters.function_tools
    tool_names = [t.name for t in tools]

    assert 'calculate' in tool_names

def test_agent_returns_output():
    """Test that agent produces output."""
    test_model = TestModel()

    result = agent.run_sync('Hello', model=test_model)

    assert result.output is not None

FunctionModel - Custom Mocking

For more control, use FunctionModel to define exactly what the "LLM" returns. Useful for:

Testing specific scenarios
Simulating tool calls
Testing error handling

function_model.py

from pydantic_ai import Agent, ModelMessage, ModelResponse, TextPart
from pydantic_ai.models.function import AgentInfo, FunctionModel

agent = Agent('openai:gpt-4o')

def test_with_custom_model():
    received_messages = []

    def mock_model(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
        # Record what was sent
        received_messages.extend(messages)
        # Return a fixed response
        return ModelResponse(parts=[TextPart(content='Mocked response')])

    result = agent.run_sync('Hello', model=FunctionModel(mock_model))

    assert len(received_messages) > 0
    assert result.output == 'Mocked response'

Testing Tools Directly

Test your tool functions independently of the agent:

test_tools.py

import pytest
from pydantic_ai import Agent, RunContext
from dataclasses import dataclass
from unittest.mock import Mock

@dataclass
class TestDeps:
    call_log: list = None

    def __post_init__(self):
        self.call_log = []

agent = Agent('openai:gpt-4o', deps_type=TestDeps)

@agent.tool
def fetch_data(ctx: RunContext[TestDeps], item_id: int) -> dict:
    ctx.deps.call_log.append(item_id)
    return {'id': item_id, 'name': f'Item {item_id}'}

def test_fetch_data_tool():
    """Test tool function directly."""
    deps = TestDeps()

    # Create mock context
    mock_ctx = Mock()
    mock_ctx.deps = deps

    # Call tool directly (bypass agent)
    result = fetch_data.__wrapped__(mock_ctx, 42)

    assert result == {'id': 42, 'name': 'Item 42'}
    assert deps.call_log == [42]

LLM-as-Judge

Use another LLM to evaluate response quality:

llm_judge.py

from pydantic import BaseModel
from pydantic_ai import Agent

class Evaluation(BaseModel):
    score: float  # 0 to 1
    reasoning: str
    suggestions: list[str]

evaluator = Agent(
    'openai:gpt-4o',
    output_type=Evaluation,
    instructions='''Evaluate the AI response.
    Score from 0 (terrible) to 1 (perfect).
    Consider: accuracy, helpfulness, clarity, safety.
    Provide reasoning and suggestions for improvement.'''
)

async def evaluate_response(question: str, response: str) -> Evaluation:
    result = await evaluator.run(f'''
    Question: {question}
    Response: {response}

    Evaluate this response.
    ''')
    return result.output

# Usage
eval_result = await evaluate_response(
    question="What is the capital of France?",
    response="Paris is the capital of France."
)
print(f"Score: {eval_result.score}")  # Score: 0.95
print(f"Reasoning: {eval_result.reasoning}")

Regression Testing

Track quality over time to catch regressions:

regression_tracker.py

import json
from datetime import datetime

class EvaluationTracker:
    def __init__(self, filepath: str):
        self.filepath = filepath
        self.history = []

    def record(self, test_name: str, score: float, details: dict):
        self.history.append({
            'timestamp': datetime.now().isoformat(),
            'test': test_name,
            'score': score,
            'details': details
        })
        self._save()

    def _save(self):
        with open(self.filepath, 'w') as f:
            json.dump(self.history, f, indent=2)

    def check_regression(self, test_name: str, current_score: float, threshold: float = 0.1):
        """Alert if score dropped significantly."""
        previous = [h for h in self.history if h['test'] == test_name]
        if previous:
            last_score = previous[-1]['score']
            if current_score < last_score - threshold:
                raise AssertionError(
                    f"Regression detected: {test_name} dropped from {last_score} to {current_score}"
                )

Key Takeaways

1Test behavior, not exact output. Agents are non-deterministic.
2TestModel for unit tests. Fast, no API calls, predictable.
3FunctionModel for mocking. Full control over "LLM" responses.
4LLM-as-judge for quality. Let another LLM evaluate responses.
5Track regressions. Catch quality drops before they reach production.