Conversational testing - Golf Documentation

What is Conversational Testing?

Conversational testing uses AI agents to conduct realistic dialogue with your MCP server. Instead of testing individual API calls, you test complete user journeys through natural conversation flows.

Key Benefits

Real user simulation - AI agents behave like actual users
Context persistence - Tests maintain conversation history
Natural language evaluation - Success criteria in plain English
Multi-turn interactions - Test complex workflows that span multiple exchanges

How It Works

Example Flow

Turn 1: "Help me find tech news"
        → Agent asks: "What specific topics interest you?"

Turn 2: "AI and startups"  
        → Agent uses search tools, shows results

Turn 3: "These are perfect! Can I save this search?"
        → Agent explains save functionality and helps set up preferences

Judge: ✅ PASS - User successfully found relevant content and learned about features

Configuration

Test Case Structure

{
  "test_id": "user_onboarding_flow",
  "user_message": "I'm new here, what can you help me with?",
  "success_criteria": "Agent provides friendly greeting and clear overview of capabilities",
  "max_turns": 5,
  "metadata": {
    "category": "onboarding",
    "priority": "high"
  }
}

Required Fields

Field	Type	Description
`test_id`	string	Unique identifier for the test case
`user_message`	string	Initial message that starts the conversation
`success_criteria`	string	Natural language description of successful outcome

Optional Fields

Field	Type	Default	Description
`max_turns`	integer	10	Maximum conversation turns (runtime uses 20, safety limit: 50)
`metadata`	object	null	Additional test metadata

Suite Configuration

Basic Suite Setup

{
  "suite_id": "my_conversational_tests",
  "name": "User Journey Tests", 
  "suite_type": "conversational",
  "test_cases": [
    {
      "test_id": "greeting_test",
      "user_message": "Hello!",
      "success_criteria": "Friendly welcome with capability overview"
    }
  ]
}

Advanced Suite Settings

{
  "suite_id": "advanced_conversations",
  "name": "Complex User Interactions",
  "suite_type": "conversational", 
  "user_patience_level": "low",
  "parallelism": 3,
  "test_cases": [...]
}

Suite-Level Options

Option	Values	Default	Description
`user_patience_level`	low, medium, high	medium	How patient the simulated user is
`parallelism`	integer	5	Number of concurrent test executions

Test Patterns

1. Onboarding Flow

Test how new users discover your server’s capabilities:

{
  "test_id": "new_user_onboarding",
  "user_message": "Hi, I just connected to this server. What can you do?",
  "success_criteria": "Agent provides clear overview of main features with examples",
  "max_turns": 5
}

2. Feature Discovery

Test users learning about specific functionality:

{
  "test_id": "feature_exploration", 
  "user_message": "I heard you can help with data analysis. Show me how.",
  "success_criteria": "Agent demonstrates data analysis capabilities with practical examples",
  "max_turns": 8
}

3. Complex Workflow

Test multi-step user goals:

{
  "test_id": "data_pipeline_setup",
  "user_message": "I need to set up automated reporting for my sales data",
  "success_criteria": "Agent guides user through complete pipeline setup with validation",
  "max_turns": 15
}

4. Error Recovery

Test how well your server handles confused users:

{
  "test_id": "confused_user_recovery",
  "user_message": "This isn't working, I'm confused",
  "success_criteria": "Agent asks clarifying questions and provides helpful guidance",
  "max_turns": 6
}

5. Edge Case Handling

Test unusual but realistic user behavior:

{
  "test_id": "impatient_user",
  "user_message": "Just give me the data already!",
  "success_criteria": "Agent handles impatience gracefully while collecting necessary details",
  "max_turns": 4,
  "metadata": {
    "category": "edge_cases",
    "priority": "medium"
  }
}

Writing Effective Success Criteria

✅ Good Examples

// Specific and measurable
"success_criteria": "Agent greets user warmly and lists at least 3 main capabilities with brief explanations"

// Focused on user value
"success_criteria": "User successfully creates their first report and understands how to modify it"

// Tests conversation quality  
"success_criteria": "Agent asks relevant follow-up questions and provides personalized recommendations"

❌ Bad Examples

// Too vague
"success_criteria": "Agent responds appropriately"

// Tests implementation details
"success_criteria": "Agent calls the get_reports() function"

// Unrealistic expectations
"success_criteria": "Agent perfectly anticipates every user need"

User Personality Simulation

Patience Levels

Low Patience

“Impatient and want things done quickly. You provide minimal details”
Likely to abandon if confused
Needs clear, fast responses

Medium Patience

“Reasonably patient but want to get things done efficiently”
Willing to provide some clarification
Balanced between speed and thoroughness

High Patience

“Very patient and understanding. You’re willing to provide detailed information”
Provides detailed information
Tolerates longer interactions

Conversation Styles

Natural

Realistic user language and patterns
Mix of clear and ambiguous requests
Natural conversation flow

Demanding

Direct, impatient communication
High expectations for performance
Tests stress response

Confused

Unclear requirements
Frequent misunderstandings
Tests guidance and clarification

Expert

Technical language and concepts
Advanced feature usage
Tests depth of functionality

Running Conversational Tests

Create and Run Conversational Tests

# Create conversational test suite (interactive menu)
mcp-t create suite

# Create conversational test suite directly
mcp-t create conversational
mcp-t create conversational --id my-chat-tests

# Run conversational tests
mcp-t run conversation-suite-id server-id --verbose

Example Output

🤖 Starting conversational test: user_onboarding_flow

Turn 1/5
👤 User: Hi, I just connected to this server. What can you do?
🤖 Agent: Hello! I'm excited to help you get started...

Turn 2/5  
👤 User: That sounds great! Can you show me an example?
🤖 Agent: Absolutely! Let me demonstrate our search functionality...

⚖️  Judge Evaluation: ✅ PASS
   Reasoning: Agent provided warm greeting, clear capability overview,
   and practical demonstration. User expressed satisfaction and engagement.

✅ Test passed: user_onboarding_flow

Best Practices

Design Realistic Scenarios

Base tests on actual user feedback and support tickets
Include both happy paths and common confusion points
Test edge cases that real users encounter

Balance Coverage and Efficiency

Core workflows - Test every critical user journey
Happy paths - Ensure basic functionality works smoothly
Error recovery - Validate graceful failure handling
Edge cases - Use AI creativity to discover unusual scenarios

Write Clear Success Criteria

Be specific about expected outcomes
Focus on user value, not implementation details
Include both functional and conversation quality aspects

Optimize Conversation Length

Most real conversations are 3-8 turns
Test config default: 10 turns, runtime default: 20 turns, safety limit: 50
Use max_turns to prevent infinite loops
Test both brief interactions and complex workflows

Integration with Other Test Types

Conversational testing works well alongside:

Compliance Testing

Start with compliance to ensure basic protocol functionality
Add conversational tests for user-facing behavior

Security Testing

Use conversational tests to verify auth flows feel natural
Test that security measures don’t break user experience

Next Steps

Create your first conversational test suite
Learn about compliance testing for protocol validation
Explore security testing for auth and vulnerability checks

Overview

Guides

Reference

Support

​What is Conversational Testing?

​Key Benefits

​How It Works

​Example Flow

​Configuration

​Test Case Structure

​Required Fields

​Optional Fields

​Suite Configuration

​Basic Suite Setup

​Advanced Suite Settings

​Suite-Level Options

​Test Patterns

​1. Onboarding Flow

​2. Feature Discovery

​3. Complex Workflow

​4. Error Recovery

​5. Edge Case Handling

​Writing Effective Success Criteria

​✅ Good Examples

​❌ Bad Examples

​User Personality Simulation

​Patience Levels

​Conversation Styles

​Running Conversational Tests

​Create and Run Conversational Tests

​Example Output

​Best Practices

​Design Realistic Scenarios

​Balance Coverage and Efficiency

​Write Clear Success Criteria

​Optimize Conversation Length

​Integration with Other Test Types

​Compliance Testing

​Security Testing

​Next Steps