Prerequisites
- MCP Testing Framework installed
- An MCP server to test (we’ll use a demo server)
1
Create a Server Configuration
Use the CLI command to create your first server configuration:Follow the interactive prompts to configure your server. For this tutorial, you can also create the configuration manually:Example server configuration (This configures a test server that provides Hacker News functionality via MCP.
configs/servers/my-first-server.json
):2
Create a Test Suite
Use the CLI command to create your first test suite:Follow the interactive prompts to configure your test suite. For this tutorial, you can also create the configuration manually:
configs/suites/my-first-suite.json
:Understanding Test Case Structure
Each test case includes:Unique identifier for the test
What the AI agent will say to your server
Natural language description of what constitutes success
Maximum conversation turns before timeout
Whether context carries between turns
Optional categorization and priority
Writing Good Success Criteria
Success criteria should be specific but flexible:Good examples:- “Agent should respond politely and explain Hacker News capabilities”
- “Agent should list at least 2 available tools and explain their purpose”
- “Agent should successfully fetch and display story titles with URLs”
- Too vague: “Agent should work correctly”
- Too specific: “Agent must respond with exactly ‘Welcome to Hacker News!’”
- Technical: “Agent should make HTTP GET request to /stories endpoint”
3
Run Your First Test
Execute the test suite:You’ll see a summary of test results showing which tests passed or failed, along with confidence scores from the LLM judge.
4
Analyze Your Results
Test Verdict Meanings
- ✅ PASS: Test met the success criteria
- ❌ FAIL: Test did not meet the success criteria
- ⚠️ TIMEOUT: Test exceeded maximum turns or time limit
- 🔥 ERROR: Technical error occurred (server unreachable, API issues)
Judge Reasoning
The LLM judge analyzes each conversation and provides:- Verdict: Pass/fail decision
- Reasoning: Detailed explanation of why the test passed or failed
- Confidence score: How confident the judge is (0.0-1.0)
- Conversation quality: How natural and helpful the interaction was
Interpreting Confidence Scores
- 0.9-1.0: Very confident - clear pass/fail
- 0.7-0.89: Confident - good evidence for verdict
- 0.5-0.69: Somewhat confident - borderline cases
- Below 0.5: Low confidence - may need better success criteria
5
View Detailed Results
For more detailed analysis:This shows recent test runs with their results and completion status.
6
Iterate and Improve
Adding More Test Cases
Add more test cases to your suite to cover different scenarios:Running with Verbose Output
Get more detailed output during test execution:Testing Different Scenarios
Create suites for different types of testing:- Happy path: Normal user workflows
- Edge cases: Unusual requests or error conditions
- Security: Authentication, input validation
- Performance: Response times, handling multiple requests
Common First-Time Issues
Server Connection Failures
Server Connection Failures
If your server is unreachable:Solutions:
- Verify server URL is correct and accessible
- Check server is running and responding to MCP protocol
- Test server manually with
curl
or browser
Authentication Errors
Authentication Errors
If authentication fails:Solutions:
- Add authentication to server configuration
- Verify API keys or credentials are correct
- Check server authentication requirements
Next Steps
Now that you’ve created and run your first test:- Learn Core Concepts - Understand why this testing approach works
- Explore Test Types - Learn about different testing strategies
- Master the CLI - Discover more powerful commands
- Advanced Configuration - Learn about templates and advanced options