Cloudflare Turnstile Challenge

PlaywrightEval Web

Navigate websites with Cloudflare Turnstile protection, handle security challenges, bypass bot detection mechanisms, and successfully access protected content using automated browser interactions.

Created by Allison Zhan

2025-07-27

User Interaction

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
gpt-5-2-high	4 /4			1910.5s	55.5	5,045,507	61,528	5,107,036
gpt-5-high	4 /4			2037.6s	29.0	622,991	61,624	684,615
gemini-3-pro-high	2 /4			1480.8s	86.5	4,361,283	28,412	4,389,694
gemini-3-pro-low	2 /4			1034.5s	63.8	1,941,404	20,018	1,961,421
grok-4	2 /4			1248.0s	73.8	3,687,213	26,809	3,714,022
gpt-5-low	1 /4			983.4s	35.3	968,881	48,334	1,017,215
gpt-5-medium	1 /4			1241.7s	34.8	855,834	45,870	901,704
gpt-5-mini-high	1 /4			166.7s	24.8	417,340	13,709	431,050
o3	1 /4			373.2s	34.8	781,361	16,506	797,867
claude-opus-4-1	0 /1	-	-	565.0s	25.0	548,500	7,703	556,203
claude-opus-4-5-high	0 /4			272.1s	32.0	2,973,988	5,247	2,979,235
claude-sonnet-4	0 /4			317.5s	30.8	1,072,987	6,500	1,079,487
claude-sonnet-4-5	0 /4			355.4s	40.8	1,902,679	8,927	1,911,605
claude-sonnet-4-high	0 /4			265.9s	28.3	1,402,638	6,423	1,409,062
claude-sonnet-4-low	0 /4			247.2s	27.3	760,955	6,199	767,153
deepseek-chat	0 /4			508.3s	34.8	888,423	6,492	894,915
deepseek-v3-1-terminus	0 /4			350.6s	22.8	335,588	5,243	340,831
deepseek-v3-1-terminus-thinking	0 /4			477.4s	28.5	565,553	7,571	573,124
deepseek-v3-2-chat	0 /4			489.5s	48.5	1,696,006	10,545	1,706,551
deepseek-v3-2-thinking	0 /4			699.9s	62.3	2,275,429	14,325	2,289,754
gemini-2-5-flash	0 /4			105.9s	18.3	165,880	5,174	171,054
gemini-2-5-pro	0 /4			59.0s	6.5	21,491	1,431	22,922
glm-4-5	0 /4			295.6s	23.0	318,472	4,296	322,768
gpt-4-1	0 /4			16.9s	8.0	30,227	293	30,520
gpt-4-1-mini	0 /4			50.3s	13.3	73,052	763	73,815
gpt-4-1-nano	0 /4			20.8s	9.3	40,765	437	41,202
gpt-5-mini-low	0 /4			22.3s	5.3	16,065	1,214	17,280
gpt-5-mini-medium	0 /4			101.3s	19.8	259,993	6,119	266,111
gpt-5-nano-high	0 /4			133.6s	6.8	24,177	25,863	50,040
gpt-5-nano-low	0 /4			67.2s	7.8	30,440	1,965	32,404
gpt-5-nano-medium	0 /4			71.5s	7.8	31,168	12,507	43,676
gpt-oss-120b	0 /4			22.3s	5.5	22,446	1,021	23,467
grok-4-fast	0 /4			153.8s	33.5	1,734,258	6,738	1,740,996
grok-code-fast-1	0 /4			314.4s	35.0	1,600,973	8,935	1,609,908
kimi-k2-0711	0 /4			350.5s	27.3	411,861	4,807	416,668
kimi-k2-0905	0 /4			347.8s	22.3	286,860	4,064	290,924
o4-mini	0 /4			264.6s	22.0	234,865	16,500	251,365
qwen-3-coder-plus	0 /4			120.1s	25.8	425,973	3,087	429,059
qwen-3-max	0 /4			362.5s	68.5	2,056,075	3,151	2,059,226

Task State

mcp-eval-website.vercel.app

view this website to see the original task state

Instruction

Cloudflare Turnstile Authentication Challenge Task

Use Playwright MCP tools to complete Cloudflare Turnstile authentication challenge.

Requirements:

Navigate to https://eval-web.mcpmark.ai/auth/turnstile
Fill in the authentication form with provided test credentials:
- Username: "testuser"
- Password: "password123"
Wait for the Cloudflare Turnstile challenge widget to load completely
Interact with the Turnstile challenge widget to complete the authentication (if needed)
Wait for successful challenge completion (widget shows success state with checkmark)
Submit the form by clicking the "Sign In" button
Wait for and capture any success message or confirmation that appears
Output the success capture in step 7

Notes:

Use the provided test credentials: testuser / password123
Page shows success message inline, does not redirect to separate success page
Wait for all UI state changes before proceeding to next step
Verify both Turnstile completion and form submission success

Verify

Python

#!/usr/bin/env python3
"""
Simplified verification script for Playwright Cloudflare Turnstile authentication task.

This script only verifies that the model successfully reported capturing the expected
success message by checking the last assistant message in messages.json.
"""

import sys
import json
import os

# Expected success message that agent should capture
EXPECTED_SUCCESS_MESSAGE = "Authentication successful! Security challenge verified."


def get_model_response():
    """
    Get the model's response from the MCP_MESSAGES environment variable.
    Returns the last assistant message text.
    """
    messages_path = os.getenv("MCP_MESSAGES")
    print(f"MCP_MESSAGES: {messages_path}")
    if not messages_path:
        print("Warning: MCP_MESSAGES environment variable not set", file=sys.stderr)
        return None
    
    try:
        with open(messages_path, 'r') as f:
            messages = json.load(f)
        
        # Find the last assistant message with status completed
        for message in reversed(messages):
            if (message.get('role') == 'assistant' and 
                message.get('status') == 'completed' and 
                message.get('type') == 'message'):
                content = message.get('content', [])
                # Extract text from content
                if isinstance(content, list):
                    for item in content:
                        if isinstance(item, dict) and item.get('type') in ['text', 'output_text']:
                            return item.get('text', '')
                elif isinstance(content, str):
                    return content
        
        print("Warning: No completed assistant message found", file=sys.stderr)
        return None
    except Exception as e:
        print(f"Error reading messages file: {str(e)}", file=sys.stderr)
        return None


def verify():
    """
    Verifies that the model's last response contains the expected success message.
    """
    # Get model's response from MCP_MESSAGES
    model_response = get_model_response()
    
    if not model_response:
        print("No model response found", file=sys.stderr)
        return False
    
    print(f"\nModel response (first 500 chars): {model_response[:500]}...", file=sys.stderr)
    
    # Check if the expected success message is in the model's response
    if EXPECTED_SUCCESS_MESSAGE in model_response:
        print(f"\n✓ Success message found: '{EXPECTED_SUCCESS_MESSAGE}'", file=sys.stderr)
        return True
    else:
        print(f"\n✗ Success message NOT found: '{EXPECTED_SUCCESS_MESSAGE}'", file=sys.stderr)
        return False


def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = verify()
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()