Fitness Promotion Strategy

PlaywrightShopping Admin

Develop fitness product promotion campaigns by analyzing sales data, creating targeted offers, configuring promotional rules, and implementing cross-selling strategies in admin dashboard.

Created by Fanqing Meng

2025-08-17

Data ExtractionComparative AnalysisInventory ManagementContent Submission

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
claude-opus-4-1	0 /1	-	-	235.5s	12.0	675,712	1,525	677,237
claude-opus-4-5-high	0 /4			109.2s	11.5	679,492	1,486	680,978
claude-sonnet-4	0 /4			318.2s	12.5	1,073,041	1,738	1,074,779
claude-sonnet-4-5	0 /4			100.1s	11.5	566,874	1,487	568,361
claude-sonnet-4-high	0 /4			347.9s	19.0	3,386,880	3,128	3,390,008
claude-sonnet-4-low	0 /4			371.0s	22.0	4,189,150	3,643	4,192,792
deepseek-chat	0 /4			230.4s	13.5	309,883	1,558	311,440
deepseek-v3-1-terminus	0 /4			170.7s	10.3	219,299	985	220,284
deepseek-v3-1-terminus-thinking	0 /4			805.3s	11.0	235,609	17,416	253,025
deepseek-v3-2-chat	0 /4			378.5s	16.8	776,784	2,432	779,216
deepseek-v3-2-thinking	0 /4			104.0s	8.8	83,825	1,860	85,685
gemini-2-5-flash	0 /4			639.3s	59.3	19,518,781	54,671	19,573,452
gemini-2-5-pro	0 /4			137.5s	13.0	1,515,600	4,963	1,520,563
gemini-3-pro-high	0 /4			365.0s	30.0	8,120,821	9,218	8,130,038
gemini-3-pro-low	0 /4			211.2s	27.0	3,853,992	7,448	3,861,440
glm-4-5	0 /4			64.5s	9.0	99,527	1,374	100,901
gpt-4-1	0 /4			110.9s	14.3	1,192,623	897	1,193,520
gpt-4-1-mini	0 /4			240.8s	30.5	7,661,978	2,904	7,664,882
gpt-4-1-nano	0 /4			29.3s	7.3	63,976	688	64,664
gpt-5-2-high	0 /4			378.1s	12.8	1,333,799	9,941	1,343,740
gpt-5-high	0 /4			462.2s	11.3	858,847	12,003	870,850
gpt-5-low	0 /4			214.9s	12.8	753,453	7,350	760,803
gpt-5-medium	0 /4			410.1s	15.3	924,520	15,833	940,354
gpt-5-mini-high	0 /4			792.5s	38.0	4,167,528	57,969	4,225,497
gpt-5-mini-low	0 /4			61.2s	8.5	429,899	1,401	431,300
gpt-5-mini-medium	0 /4			228.2s	22.8	1,249,342	11,221	1,260,563
gpt-5-nano-high	0 /4			324.2s	26.3	2,389,582	32,922	2,422,504
gpt-5-nano-low	0 /4			146.9s	22.5	780,913	15,128	796,041
gpt-5-nano-medium	0 /4			165.1s	22.5	709,862	19,647	729,509
gpt-oss-120b	0 /4			33.4s	6.3	39,260	974	40,233
grok-4	0 /4			189.9s	10.0	507,030	3,437	510,468
grok-4-fast	0 /4			189.3s	18.3	4,626,764	7,333	4,634,097
grok-code-fast-1	0 /4			77.4s	11.8	503,655	3,345	507,000
kimi-k2-0711	0 /4			139.1s	11.3	155,821	681	156,502
kimi-k2-0905	0 /4			806.3s	20.8	1,485,868	1,675	1,487,543
o3	0 /4			138.6s	12.5	440,609	3,209	443,818
o4-mini	0 /4			532.2s	14.0	1,240,266	10,522	1,250,788
qwen-3-coder-plus	0 /4			3162.4s	32.0	11,916,092	2,680	11,918,772
qwen-3-max	0 /4			413.3s	18.0	1,291,081	589	1,291,670

Task State

WebArena

view WebArena environment setup for this task

Instruction

Our marketing team is planning a new promotion for our bestselling fitness products. We need to analyze the current performance of our top-selling items and their related promotions to optimize our strategy.

Task Requirements:

If need to login, login with username 'admin' and password 'admin1234'
Start by checking our current bestsellers:
- Identify the top 3 bestselling products based on their Price and Quantity - record their names, prices, and quantities sold
- Note the total Revenue amount displayed
- Check if any of these bestsellers appear in the Top Search Terms table - if yes, record the search term and its usage count, else output 'No:0'
Investigate these bestselling products in detail:
- For each of the top 3 bestsellers identified, search for them by name and record:
  - Their SKU
  - Current inventory quantity
  - Whether they are 'Enabled' or 'Disabled'
Check if we have existing promotions for these products:
- Look for any active rules that might apply to fitness/yoga products
- Find if there's a rule offering percentage discount - record the rule name and discount percentage
- Count total number of active rules
Analyze customer purchasing patterns:
- Count total number of orders in the system
- Note the ID of the most recent order
Review our top customers who might be interested:
- Find the customer who appears in the Last Orders section of the dashboard with the highest total
- Look up this customer in the All Customers list and record his email and customer group
- Count how many other customers are in the same group
Compile your findings and output them in the following exact format:

Plaintext


Bestseller1|name:price:quantity:sku:inventory:status
Bestseller2|name:price:quantity:sku:inventory:status
Bestseller3|name:price:quantity:sku:inventory:status
TotalRevenue|amount
BestsellerInSearch|term:count
PercentageDiscountRule|name:percentage
ActiveRulesCount|count
TotalOrders|count
MostRecentOrderID|id
TopCustomer|name:email:group
SameGroupCustomers|count

Example Output:

Plaintext


Bestseller1|Product Name:$XX.XX:X:XXX(SKU):X:Enabled/Disabled
Bestseller2|Product Name:$XX.XX:X:XXX(SKU):X:Enabled/Disabled
Bestseller3|Product Name:$XX.XX:X:XXX(SKU):X:Enabled/Disabled
TotalRevenue|$XX.XX
BestsellerInSearch|Term:X or None:0
PercentageDiscountRule|Rule Name:XX%
ActiveRulesCount|X
TotalOrders|X
MostRecentOrderID|X or None
TopCustomer|Customer Name:email@example.com:Group Name
SameGroupCustomers|X

Verify

Python

import asyncio
import sys
import re
import os
import json
from pathlib import Path

def get_model_response():
    """
    Get the model's response from the MCP_MESSAGES environment variable.
    Returns the last assistant message text.
    """
    messages_path = os.getenv("MCP_MESSAGES")
    print(f"MCP_MESSAGES: {messages_path}")
    if not messages_path:
        print("Warning: MCP_MESSAGES environment variable not set", file=sys.stderr)
        return None
    
    try:
        with open(messages_path, 'r') as f:
            messages = json.load(f)
        
        # Find the last assistant message
        for message in reversed(messages):
            if message.get('role') == 'assistant' and message.get('status') == 'completed':
                content = message.get('content', [])
                for item in content:
                    if item.get('type') == 'output_text':
                        return item.get('text', '')
        
        print("Warning: No assistant response found in messages", file=sys.stderr)
        return None
    except Exception as e:
        print(f"Error reading messages file: {str(e)}", file=sys.stderr)
        return None

def parse_answer_format(text):
    """
    Parse the ... format from the agent's output.
    Returns a dictionary with the parsed values.
    """
    if not text:
        return None
    
    # Look for ... pattern
    match = re.search(r'(.*?)', text, re.IGNORECASE | re.DOTALL)
    if not match:
        return None
    
    answer_content = match.group(1).strip()
    
    # Parse each line
    result = {}
    lines = answer_content.split('\n')
    
    # Skip the check for exact number of lines - just parse what we have
    # if len(lines) != 13:
    #     print(f"Error: Expected 13 lines in answer, got {len(lines)}", file=sys.stderr)
    #     return None
    
    for line in lines:
        if '|' in line:
            key, value = line.split('|', 1)
            result[key.strip()] = value.strip()
    
    return result

def load_expected_answer(label_path):
    """
    Load the expected answer from label.txt file.
    Returns a dictionary with the expected values.
    """
    try:
        with open(label_path, 'r') as f:
            lines = f.read().strip().split('\n')
        
        expected = {}
        for line in lines:
            if '|' in line:
                key, value = line.split('|', 1)
                expected[key.strip()] = value.strip()
        
        return expected
    except Exception as e:
        print(f"Error reading label file: {str(e)}", file=sys.stderr)
        return None

def compare_answers(model_answer, expected_answer):
    """
    Compare the model's answer with the expected answer.
    Returns True if all key information matches, False otherwise.
    """
    if not model_answer or not expected_answer:
        return False
    
    # Check each expected key
    mismatches = []
    for key, expected_value in expected_answer.items():
        model_value = model_answer.get(key, '')
        
        # Special handling for different types of values
        if key in ['Bestseller1', 'Bestseller2', 'Bestseller3']:
            # Check if all parts match (name:price:quantity:sku:inventory:status)
            if ':' in expected_value and ':' in model_value:
                expected_parts = expected_value.split(':')
                model_parts = model_value.split(':')
                if len(expected_parts) == 6 and len(model_parts) == 6:
                    # Compare each part
                    for i, (exp, mod) in enumerate(zip(expected_parts, model_parts)):
                        if i == 1:  # Price field
                            exp_clean = exp.replace('$', '').replace(',', '')
                            mod_clean = mod.replace('$', '').replace(',', '')
                            if exp_clean != mod_clean:
                                mismatches.append(f"{key} price: expected '{exp}', got '{mod}'")
                        elif i == 4:  # Inventory field (may have decimal places)
                            exp_float = float(exp.replace(',', ''))
                            mod_float = float(mod.replace(',', ''))
                            if abs(exp_float - mod_float) > 0.0001:
                                mismatches.append(f"{key} inventory: expected '{exp}', got '{mod}'")
                        else:
                            if exp.lower() != mod.lower():
                                mismatches.append(f"{key} part {i}: expected '{exp}', got '{mod}'")
                else:
                    mismatches.append(f"{key}: format mismatch - expected '{expected_value}', got '{model_value}'")
            else:
                if expected_value != model_value:
                    mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
        
        elif key == 'LowestInventoryProduct':
            # Check product name and inventory
            if ':' in expected_value and ':' in model_value:
                expected_name, expected_inv = expected_value.rsplit(':', 1)
                model_name, model_inv = model_value.rsplit(':', 1)
                if expected_name.lower() != model_name.lower():
                    mismatches.append(f"{key} name: expected '{expected_name}', got '{model_name}'")
                exp_float = float(expected_inv.replace(',', ''))
                mod_float = float(model_inv.replace(',', ''))
                if abs(exp_float - mod_float) > 0.0001:
                    mismatches.append(f"{key} inventory: expected '{expected_inv}', got '{model_inv}'")
            else:
                if expected_value != model_value:
                    mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
        
        elif key in ['TotalRevenue', 'MinimumPurchaseRule']:
            # For price/amount fields, normalize format
            expected_clean = expected_value.replace('$', '').replace(',', '')
            model_clean = model_value.replace('$', '').replace(',', '')
            if expected_clean != model_clean:
                mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
        
        elif key == 'BestsellerInSearch':
            # Check search term and count
            if expected_value.lower() != model_value.lower():
                mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
        
        elif key == 'PercentageDiscountRule':
            # Check rule name and percentage
            if ':' in expected_value and ':' in model_value:
                expected_name, expected_pct = expected_value.rsplit(':', 1)
                model_name, model_pct = model_value.rsplit(':', 1)
                if expected_name != model_name:
                    mismatches.append(f"{key} name: expected '{expected_name}', got '{model_name}'")
                # Normalize percentage (20% vs 20 vs 0.20)
                exp_pct_clean = expected_pct.replace('%', '').strip()
                mod_pct_clean = model_pct.replace('%', '').strip()
                if exp_pct_clean != mod_pct_clean:
                    mismatches.append(f"{key} percentage: expected '{expected_pct}', got '{model_pct}'")
            else:
                if expected_value != model_value:
                    mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
        
        elif key == 'TopCustomer':
            # Check name:email:group
            if ':' in expected_value and ':' in model_value:
                expected_parts = expected_value.split(':')
                model_parts = model_value.split(':')
                if len(expected_parts) == 3 and len(model_parts) == 3:
                    exp_name, exp_email, exp_group = expected_parts
                    mod_name, mod_email, mod_group = model_parts
                    if exp_name != mod_name:
                        mismatches.append(f"{key} name: expected '{exp_name}', got '{mod_name}'")
                    if exp_email.lower() != mod_email.lower():
                        mismatches.append(f"{key} email: expected '{exp_email}', got '{mod_email}'")
                    if exp_group.lower() != mod_group.lower():
                        mismatches.append(f"{key} group: expected '{exp_group}', got '{mod_group}'")
                else:
                    mismatches.append(f"{key}: format mismatch - expected '{expected_value}', got '{model_value}'")
            else:
                if expected_value != model_value:
                    mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
        
        elif key == 'MostRecentOrderDate':
            # Date format may vary, do flexible comparison
            if expected_value.lower() == 'none' and model_value.lower() == 'none':
                continue
            elif expected_value != model_value:
                # Could add more flexible date parsing here if needed
                mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
        
        else:
            # Exact match for other fields (counts, etc.)
            if str(model_value) != str(expected_value):
                mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")
    
    if mismatches:
        print("\n=== Answer Comparison Mismatches ===", file=sys.stderr)
        for mismatch in mismatches:
            print(f"✗ {mismatch}", file=sys.stderr)
        return False
    
    print("\n=== Answer Comparison ===", file=sys.stderr)
    print("✓ All key information matches the expected answer", file=sys.stderr)
    return True

async def verify() -> bool:
    """
    Verifies that the bestseller analysis and promotion task has been completed correctly.
    First checks the model's answer against the expected label,
    then optionally verifies the actual state in the Magento Admin.
    """
    # Get the label file path
    label_path = Path(__file__).parent / "label.txt"
    
    # Load expected answer
    expected_answer = load_expected_answer(label_path)
    if not expected_answer:
        print("Error: Could not load expected answer from label.txt", file=sys.stderr)
        return False
    
    # Get model's response from MCP_MESSAGES
    model_response = get_model_response()
    if model_response:
        print("Found model response, parsing answer format...", file=sys.stderr)
        model_answer = parse_answer_format(model_response)
        
        if model_answer:
            print("\n=== Model Answer Parsed ===", file=sys.stderr)
            for key, value in model_answer.items():
                print(f"{key}: {value}", file=sys.stderr)
            
            # Compare answers
            answer_match = compare_answers(model_answer, expected_answer)
            if not answer_match:
                print("\nModel answer does not match expected answer", file=sys.stderr)
                return False
            print("\n✓ Model answer matches expected answer", file=sys.stderr)
            return True
        else:
            print("Warning: Could not parse answer format from model response", file=sys.stderr)
            return False
    else:
        print("No model response found", file=sys.stderr)
        return False

def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = asyncio.run(verify())
    sys.exit(0 if result else 1)

if __name__ == "__main__":
    main()