Cloudflare Turnstile Challenge

L3
ModelContextProtocolPlaywrightEval Web

Navigate websites with Cloudflare Turnstile protection, handle security challenges, bypass bot detection mechanisms, and successfully access protected content using automated browser interactions.

Created by Allison Zhan
2025-07-27
User Interaction

Model Ranking

Click on the dots to view the trajectory of each task run
Model
Run Results
Pass@4
Pass^4
Avg Time
Avg Turns
Input Tokens
Output Tokens
Total Tokens
OpenAI
gpt-5-high
4
/4
2037.6s
29.0
622,991
61,624
684,615
Grok
grok-4
2
/4
1248.0s
73.8
3,687,213
26,809
3,714,022
OpenAI
gpt-5-low
1
/4
983.4s
35.3
968,881
48,334
1,017,215
OpenAI
gpt-5-medium
1
/4
1241.7s
34.8
855,834
45,870
901,704
OpenAI
gpt-5-mini-high
1
/4
166.7s
24.8
417,340
13,709
431,050
OpenAI
o3
1
/4
373.2s
34.8
781,361
16,506
797,867
Claude
claude-opus-4-1
0
/1
--
565.0s
25.0
548,500
7,703
556,203
Claude
claude-sonnet-4
0
/4
317.5s
30.8
1,072,987
6,500
1,079,487
Claude
claude-sonnet-4-high
0
/4
265.9s
28.3
1,402,638
6,423
1,409,062
Claude
claude-sonnet-4-low
0
/4
247.2s
27.3
760,955
6,199
767,153
DeepSeek
deepseek-chat
0
/4
508.3s
34.8
888,423
6,492
894,915
Gemini
gemini-2-5-flash
0
/4
105.9s
18.3
165,880
5,174
171,054
Gemini
gemini-2-5-pro
0
/4
59.0s
6.5
21,491
1,431
22,922
Z.ai
glm-4-5
0
/4
295.6s
23.0
318,472
4,296
322,768
OpenAI
gpt-4-1
0
/4
16.9s
8.0
30,227
293
30,520
OpenAI
gpt-4-1-mini
0
/4
50.3s
13.3
73,052
763
73,815
OpenAI
gpt-4-1-nano
0
/4
20.8s
9.3
40,765
437
41,202
OpenAI
gpt-5-mini-low
0
/4
22.3s
5.3
16,065
1,214
17,280
OpenAI
gpt-5-mini-medium
0
/4
101.3s
19.8
259,993
6,119
266,111
OpenAI
gpt-5-nano-high
0
/4
133.6s
6.8
24,177
25,863
50,040
OpenAI
gpt-5-nano-low
0
/4
67.2s
7.8
30,440
1,965
32,404
OpenAI
gpt-5-nano-medium
0
/4
71.5s
7.8
31,168
12,507
43,676
OpenAI
gpt-oss-120b
0
/4
22.3s
5.5
22,446
1,021
23,467
Grok
grok-code-fast-1
0
/4
314.4s
35.0
1,600,973
8,935
1,609,908
MoonshotAI
kimi-k2-0711
0
/4
350.5s
27.3
411,861
4,807
416,668
MoonshotAI
kimi-k2-0905
0
/4
347.8s
22.3
286,860
4,064
290,924
OpenAI
o4-mini
0
/4
264.6s
22.0
234,865
16,500
251,365
Qwen
qwen-3-coder-plus
0
/4
120.1s
25.8
425,973
3,087
429,059
Qwen
qwen-3-max
0
/4
362.5s
68.5
2,056,075
3,151
2,059,226

Task State


Instruction

Cloudflare Turnstile Authentication Challenge Task

Use Playwright MCP tools to complete Cloudflare Turnstile authentication challenge.

Requirements:

  1. Navigate to https://eval-web.mcpmark.ai/auth/turnstile
  2. Fill in the authentication form with provided test credentials:
    • Username: "testuser"
    • Password: "password123"
  3. Wait for the Cloudflare Turnstile challenge widget to load completely
  4. Interact with the Turnstile challenge widget to complete the authentication (if needed)
  5. Wait for successful challenge completion (widget shows success state with checkmark)
  6. Submit the form by clicking the "Sign In" button
  7. Wait for and capture any success message or confirmation that appears
  8. Output the success capture in step 7

Notes:

  • Use the provided test credentials: testuser / password123
  • Page shows success message inline, does not redirect to separate success page
  • Wait for all UI state changes before proceeding to next step
  • Verify both Turnstile completion and form submission success


Verify

*.py
Python
#!/usr/bin/env python3
"""
Simplified verification script for Playwright Cloudflare Turnstile authentication task.

This script only verifies that the model successfully reported capturing the expected
success message by checking the last assistant message in messages.json.
"""

import sys
import json
import os

# Expected success message that agent should capture
EXPECTED_SUCCESS_MESSAGE = "Authentication successful! Security challenge verified."


def get_model_response():
    """
    Get the model's response from the MCP_MESSAGES environment variable.
    Returns the last assistant message text.
    """
    messages_path = os.getenv("MCP_MESSAGES")
    print(f"MCP_MESSAGES: {messages_path}")
    if not messages_path:
        print("Warning: MCP_MESSAGES environment variable not set", file=sys.stderr)
        return None
    
    try:
        with open(messages_path, 'r') as f:
            messages = json.load(f)
        
        # Find the last assistant message with status completed
        for message in reversed(messages):
            if (message.get('role') == 'assistant' and 
                message.get('status') == 'completed' and 
                message.get('type') == 'message'):
                content = message.get('content', [])
                # Extract text from content
                if isinstance(content, list):
                    for item in content:
                        if isinstance(item, dict) and item.get('type') in ['text', 'output_text']:
                            return item.get('text', '')
                elif isinstance(content, str):
                    return content
        
        print("Warning: No completed assistant message found", file=sys.stderr)
        return None
    except Exception as e:
        print(f"Error reading messages file: {str(e)}", file=sys.stderr)
        return None


def verify():
    """
    Verifies that the model's last response contains the expected success message.
    """
    # Get model's response from MCP_MESSAGES
    model_response = get_model_response()
    
    if not model_response:
        print("No model response found", file=sys.stderr)
        return False
    
    print(f"\nModel response (first 500 chars): {model_response[:500]}...", file=sys.stderr)
    
    # Check if the expected success message is in the model's response
    if EXPECTED_SUCCESS_MESSAGE in model_response:
        print(f"\n✓ Success message found: '{EXPECTED_SUCCESS_MESSAGE}'", file=sys.stderr)
        return True
    else:
        print(f"\n✗ Success message NOT found: '{EXPECTED_SUCCESS_MESSAGE}'", file=sys.stderr)
        return False


def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = verify()
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()