The Non-Determinism Problem
Normal software is deterministic: add(2, 2) always returns 4.
Agents are non-deterministic. All of these are correct responses to "What's 2+2?":
- "4"
- "The answer is 4"
- "2+2 equals 4"
- "Four"
Traditional assert output == expected tests fail.
What Can Go Wrong
Agents can fail in many ways:
- Wrong tool selection: Called the wrong tool
- Wrong arguments: Called the right tool with wrong parameters
- Bad reasoning: Made incorrect conclusions from tool results
- Safety issues: Generated harmful content
- Efficiency: Used too many tokens or API calls
Testing Strategy: Instead of testing exact outputs, test:
- Behavior: Did it call the right tools?
- Structure: Is the output in the right format?
- Quality: Is the answer helpful? (using another LLM)
- Safety: Is the response appropriate?
- Efficiency: Token and API call counts
TestModel - Deterministic Testing
PydanticAI provides TestModel for testing without making real API calls. Useful for:
- Verifying tool availability
- Testing agent configuration
- Fast unit tests (no API calls)
test_model.py
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
agent = Agent(
'openai:gpt-4o',
instructions='Help with calculations.'
)
@agent.tool_plain
def calculate(expression: str) -> str:
return str(eval(expression))
def test_agent_has_calculate_tool():
"""Test that the calculate tool is available."""
test_model = TestModel()
result = agent.run_sync(
'What is 2+2?',
model=test_model # Use test model instead of real LLM
)
# Check that tool is available
tools = test_model.last_model_request_parameters.function_tools
tool_names = [t.name for t in tools]
assert 'calculate' in tool_names
def test_agent_returns_output():
"""Test that agent produces output."""
test_model = TestModel()
result = agent.run_sync('Hello', model=test_model)
assert result.output is not NoneFunctionModel - Custom Mocking
For more control, use FunctionModel to define exactly what the "LLM" returns. Useful for:
- Testing specific scenarios
- Simulating tool calls
- Testing error handling
function_model.py
from pydantic_ai import Agent, ModelMessage, ModelResponse, TextPart
from pydantic_ai.models.function import AgentInfo, FunctionModel
agent = Agent('openai:gpt-4o')
def test_with_custom_model():
received_messages = []
def mock_model(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
# Record what was sent
received_messages.extend(messages)
# Return a fixed response
return ModelResponse(parts=[TextPart(content='Mocked response')])
result = agent.run_sync('Hello', model=FunctionModel(mock_model))
assert len(received_messages) > 0
assert result.output == 'Mocked response'Testing Tools Directly
Test your tool functions independently of the agent:
test_tools.py
import pytest
from pydantic_ai import Agent, RunContext
from dataclasses import dataclass
from unittest.mock import Mock
@dataclass
class TestDeps:
call_log: list = None
def __post_init__(self):
self.call_log = []
agent = Agent('openai:gpt-4o', deps_type=TestDeps)
@agent.tool
def fetch_data(ctx: RunContext[TestDeps], item_id: int) -> dict:
ctx.deps.call_log.append(item_id)
return {'id': item_id, 'name': f'Item {item_id}'}
def test_fetch_data_tool():
"""Test tool function directly."""
deps = TestDeps()
# Create mock context
mock_ctx = Mock()
mock_ctx.deps = deps
# Call tool directly (bypass agent)
result = fetch_data.__wrapped__(mock_ctx, 42)
assert result == {'id': 42, 'name': 'Item 42'}
assert deps.call_log == [42]LLM-as-Judge
Use another LLM to evaluate response quality:
llm_judge.py
from pydantic import BaseModel
from pydantic_ai import Agent
class Evaluation(BaseModel):
score: float # 0 to 1
reasoning: str
suggestions: list[str]
evaluator = Agent(
'openai:gpt-4o',
output_type=Evaluation,
instructions='''Evaluate the AI response.
Score from 0 (terrible) to 1 (perfect).
Consider: accuracy, helpfulness, clarity, safety.
Provide reasoning and suggestions for improvement.'''
)
async def evaluate_response(question: str, response: str) -> Evaluation:
result = await evaluator.run(f'''
Question: {question}
Response: {response}
Evaluate this response.
''')
return result.output
# Usage
eval_result = await evaluate_response(
question="What is the capital of France?",
response="Paris is the capital of France."
)
print(f"Score: {eval_result.score}") # Score: 0.95
print(f"Reasoning: {eval_result.reasoning}")Regression Testing
Track quality over time to catch regressions:
regression_tracker.py
import json
from datetime import datetime
class EvaluationTracker:
def __init__(self, filepath: str):
self.filepath = filepath
self.history = []
def record(self, test_name: str, score: float, details: dict):
self.history.append({
'timestamp': datetime.now().isoformat(),
'test': test_name,
'score': score,
'details': details
})
self._save()
def _save(self):
with open(self.filepath, 'w') as f:
json.dump(self.history, f, indent=2)
def check_regression(self, test_name: str, current_score: float, threshold: float = 0.1):
"""Alert if score dropped significantly."""
previous = [h for h in self.history if h['test'] == test_name]
if previous:
last_score = previous[-1]['score']
if current_score < last_score - threshold:
raise AssertionError(
f"Regression detected: {test_name} dropped from {last_score} to {current_score}"
)Key Takeaways
- 1Test behavior, not exact output. Agents are non-deterministic.
- 2TestModel for unit tests. Fast, no API calls, predictable.
- 3FunctionModel for mocking. Full control over "LLM" responses.
- 4LLM-as-judge for quality. Let another LLM evaluate responses.
- 5Track regressions. Catch quality drops before they reach production.