Multi Category Budget Analysis

PlaywrightShopping

Analyze spending patterns across multiple product categories, optimize budget allocation, identify cost-saving opportunities, and generate comprehensive financial planning report with purchase recommendations.

Created by Yaoqi Ye

2025-08-17

Data ExtractionSearch AggregationContent SubmissionComparative AnalysisInventory Management

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
gpt-5-2-high	4 /4			874.7s	38.5	4,931,729	24,712	4,956,441
claude-opus-4-1	0 /1	-	-	690.5s	27.0	2,680,142	3,645	2,683,787
claude-opus-4-5-high	0 /4			237.8s	24.5	2,171,307	3,980	2,175,287
claude-sonnet-4	0 /4			392.6s	26.3	2,653,743	4,038	2,657,781
claude-sonnet-4-5	0 /4			234.0s	25.3	2,450,393	3,747	2,454,140
claude-sonnet-4-high	0 /4			291.8s	26.3	2,383,422	4,473	2,387,895
claude-sonnet-4-low	0 /4			301.3s	26.3	2,504,114	4,467	2,508,581
deepseek-chat	0 /4			323.5s	20.0	1,419,454	1,955	1,421,409
deepseek-v3-1-terminus	0 /4			366.4s	15.0	957,099	1,009	958,108
deepseek-v3-1-terminus-thinking	0 /4			1209.9s	16.3	929,808	21,256	951,063
deepseek-v3-2-chat	0 /4			251.2s	19.3	1,386,863	2,675	1,389,537
deepseek-v3-2-thinking	0 /4			320.0s	22.8	1,570,552	4,856	1,575,408
gemini-2-5-flash	0 /4			200.6s	29.3	7,700,368	5,160	7,705,528
gemini-2-5-pro	0 /4			258.3s	25.0	3,127,209	10,933	3,138,141
gemini-3-pro-high	0 /4			617.8s	43.5	9,930,054	18,105	9,948,159
gemini-3-pro-low	0 /4			435.5s	44.3	8,141,572	14,263	8,155,835
glm-4-5	0 /4			150.2s	15.5	876,920	1,912	878,833
gpt-4-1	0 /4			104.5s	16.5	1,007,985	1,233	1,009,217
gpt-4-1-mini	0 /4			439.0s	62.0	17,299,015	10,149	17,309,163
gpt-4-1-nano	0 /4			40.7s	9.8	417,765	347	418,112
gpt-5-high	0 /4			2591.8s	40.5	5,407,023	71,747	5,478,770
gpt-5-low	0 /4			1282.9s	39.3	5,108,526	44,150	5,152,677
gpt-5-medium	0 /4			1558.9s	39.3	5,029,881	56,080	5,085,961
gpt-5-mini-high	0 /4			702.5s	42.0	5,521,676	41,997	5,563,673
gpt-5-mini-low	0 /4			176.1s	15.0	1,124,320	3,510	1,127,830
gpt-5-mini-medium	0 /4			571.6s	40.3	5,379,705	25,185	5,404,890
gpt-5-nano-high	0 /4			319.9s	25.5	1,641,986	44,685	1,686,671
gpt-5-nano-low	0 /4			90.1s	5.5	140,844	15,325	156,169
gpt-5-nano-medium	0 /4			115.6s	20.0	946,299	11,386	957,684
gpt-oss-120b	0 /4			57.8s	8.5	459,562	1,514	461,076
grok-4	0 /4			455.2s	36.5	4,557,181	9,211	4,566,392
grok-4-fast	0 /4			201.1s	35.5	4,057,556	9,729	4,067,285
grok-code-fast-1	0 /4			224.4s	39.8	4,852,290	9,117	4,861,407
kimi-k2-0711	0 /4			342.0s	21.8	1,542,527	1,608	1,544,135
kimi-k2-0905	0 /4			1601.5s	42.8	5,761,826	4,260	5,766,085
o3	0 /4			73.9s	9.5	334,663	1,609	336,273
o4-mini	0 /4			1015.1s	22.0	2,077,932	26,589	2,104,520
qwen-3-coder-plus	0 /4			944.1s	25.3	3,947,212	2,061	3,949,272
qwen-3-max	0 /4			721.0s	33.0	4,056,337	1,026	4,057,362

Task State

WebArena

view WebArena environment setup for this task

Instruction

Complete the following tasks on One Stop Market (http://localhost:7770):

Task Requirements:

In Chocolate subcategory, sort by price (ascending):
- Record price and SKU of first 3 products
Search for 'tabletop' with price range $100.00-$200.00:
- Find the cheapest tabletop that has the highest review rating with at least 3 reviews.
- Record search results count
- Record price of required tabletop
In "Computers & Accessories" subcategory with price filter $0.00-$9,999.99:
- Sort by price (ascending)
- Record price of cheapest item
Add these products to comparison:
- "Little Secrets Chocolate Pieces, Peanut Butter Flavor"
- "Multi Accessory Hub Adapter By JOBY"
- "SanDisk Cruzer Glide 32GB (5 Pack) USB 2.0 Flash Drive"
- Count total items on comparison page
In cart:
- Add the cheapest chocolate product (from step 1) with "Peanut flavor" if available
- Add cheapest computer accessory (from step 3)
- Record cart subtotal and item count
Calculate:
- Sum of 3 chocolate product prices
- Price difference: cheapest tabletop minus cheapest computer accessory
- Whether sum of 3 comparison items < $60

Output Format:

Plaintext


chocolate_products|Price1:SKU1;Price2:SKU2;Price3:SKU3
chocolate_sum|Total
tabletop_search_count|Count
tabletop_product|Price:SKU
tabletop_reviews|NumbersOfReviews:Rating
cheapest_computer_accessory|Price
price_difference|Amount
comparison_count|Count
cart_subtotal|Amount
cart_item_count|Count
under_60_budget|YES/NO

Verify

Python

import asyncio
import sys
import re
import os
import json
from pathlib import Path


def get_model_response():
    """
    Get the model's response from the MCP_MESSAGES environment variable.
    Returns the last assistant message text.
    """
    messages_path = os.getenv("MCP_MESSAGES")
    print(f"MCP_MESSAGES: {messages_path}")
    if not messages_path:
        print("Warning: MCP_MESSAGES environment variable not set", file=sys.stderr)
        return None

    try:
        with open(messages_path, "r") as f:
            messages = json.load(f)

        # Find the last assistant message
        for message in reversed(messages):
            if (
                message.get("role") == "assistant"
                and message.get("status") == "completed"
                and message.get("type") == "message"
            ):
                content = message.get("content", [])
                for item in content:
                    if item.get("type") == "output_text":
                        return item.get("text", "")

        print("Warning: No assistant response found in messages", file=sys.stderr)
        return None
    except Exception as e:
        print(f"Error reading messages file: {str(e)}", file=sys.stderr)
        return None


def parse_answer_format(text):
    """
    Parse the ... format from the agent's output.
    Returns a dictionary with the parsed values.
    """
    if not text:
        return None

    # Look for ... pattern
    match = re.search(r"(.*?)", text, re.IGNORECASE | re.DOTALL)
    if not match:
        return None

    answer_content = match.group(1).strip()

    # Parse each line
    result = {}
    lines = answer_content.split("\n")

    if len(lines) != 11:
        print(f"Error: Expected 11 lines in answer, got {len(lines)}", file=sys.stderr)
        return None

    for line in lines:
        if "|" in line:
            key, value = line.split("|", 1)
            result[key.strip()] = value.strip()

    return result


def load_expected_answer(label_path):
    """
    Load the expected answer from label.txt file.
    Returns a dictionary with the expected values.
    """
    try:
        with open(label_path, "r") as f:
            lines = f.read().strip().split("\n")

        expected = {}
        for line in lines:
            if "|" in line:
                key, value = line.split("|", 1)
                expected[key.strip()] = value.strip()

        return expected
    except Exception as e:
        print(f"Error reading label file: {str(e)}", file=sys.stderr)
        return None


def compare_answers(model_answer, expected_answer):
    """
    Compare the model's answer with the expected answer.
    Returns True if all key information matches, False otherwise.
    """
    if not model_answer or not expected_answer:
        return False

    # Check each expected key
    mismatches = []
    for key, expected_value in expected_answer.items():
        model_value = model_answer.get(key, "")

        # Special handling for different types of values
        if key == "chocolate_products":
            # Parse and compare chocolate products with price:SKU format
            expected_products = expected_value.split(";")
            model_products = model_value.split(";")
            
            if len(expected_products) != len(model_products):
                mismatches.append(f"{key}: expected {len(expected_products)} products, got {len(model_products)}")
            else:
                for i, (exp, mod) in enumerate(zip(expected_products, model_products)):
                    exp_parts = exp.strip().split(":")
                    mod_parts = mod.strip().split(":")
                    if len(exp_parts) != 2 or len(mod_parts) != 2:
                        mismatches.append(f"{key}: product {i+1} format error - expected 'price:SKU'")
                    else:
                        # Check price format (should start with $)
                        if not mod_parts[0].startswith("$"):
                            mismatches.append(f"{key}: product {i+1} price format error - expected '$XX.XX' format, got '{mod_parts[0]}'")
                        elif exp_parts[0] != mod_parts[0] or exp_parts[1] != mod_parts[1]:
                            mismatches.append(f"{key}: product {i+1} mismatch - expected '{exp}', got '{mod}'")

        elif key == "tabletop_product":
            # Parse and compare tabletop product with price:SKU format
            exp_parts = expected_value.strip().split(":")
            mod_parts = model_value.strip().split(":")
            if len(exp_parts) != 2 or len(mod_parts) != 2:
                mismatches.append(f"{key}: format error - expected 'price:SKU', got '{model_value}'")
            else:
                # Check price format (should start with $)
                if not mod_parts[0].startswith("$"):
                    mismatches.append(f"{key}: price format error - expected '$XX.XX' format, got '{mod_parts[0]}'")
                elif exp_parts[0] != mod_parts[0] or exp_parts[1] != mod_parts[1]:
                    mismatches.append(f"{key}: mismatch - expected '{expected_value}', got '{model_value}'")
        
        elif key == "tabletop_reviews":
            # Parse and compare tabletop reviews with NumberOfReviews:Rating format
            exp_parts = expected_value.strip().split(":")
            mod_parts = model_value.strip().split(":")
            if len(exp_parts) != 2 or len(mod_parts) != 2:
                mismatches.append(f"{key}: format error - expected 'NumberOfReviews:Rating', got '{model_value}'")
            else:
                # Check if both parts match
                if exp_parts[0] != mod_parts[0] or exp_parts[1] != mod_parts[1]:
                    mismatches.append(f"{key}: mismatch - expected '{expected_value}', got '{model_value}'")

        elif key in ["chocolate_sum", "price_difference", "cart_subtotal", "cheapest_computer_accessory"]:
            # For price fields, only support $XX.XX format
            # Check if model value has correct format
            if not model_value.startswith("$"):
                mismatches.append(
                    f"{key}: incorrect format - expected '$XX.XX' format, got '{model_value}'"
                )
            else:
                # Normalize and compare values
                expected_clean = expected_value.replace("$", "").replace(",", "")
                model_clean = model_value.replace("$", "").replace(",", "")
                if expected_clean != model_clean:
                    mismatches.append(
                        f"{key}: expected '{expected_value}', got '{model_value}'"
                    )

        elif key == "under_60_budget":
            # Compare YES/NO value (case-insensitive)
            if expected_value.upper() != model_value.upper():
                mismatches.append(f"{key}: expected '{expected_value}', got '{model_value}'")

        elif key in ["tabletop_search_count", "comparison_count", "cart_item_count"]:
            # Numeric fields - exact match
            if model_value != expected_value:
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )

        else:
            # Exact match for other fields
            if model_value != expected_value:
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )

    if mismatches:
        print("\n=== Answer Comparison Mismatches ===", file=sys.stderr)
        for mismatch in mismatches:
            print(f"✗ {mismatch}", file=sys.stderr)
        return False

    print("\n=== Answer Comparison ===", file=sys.stderr)
    print("✓ All key information matches the expected answer", file=sys.stderr)
    return True


async def verify() -> bool:
    """
    Verifies that the multi-category budget analysis task has been completed correctly.
    """
    # Get the label file path
    label_path = Path(__file__).parent / "label.txt"
    
    # Load expected answer
    expected_answer = load_expected_answer(label_path)
    if not expected_answer:
        print("Error: Could not load expected answer from label.txt", file=sys.stderr)
        return False

    # Get model's response from MCP_MESSAGES
    model_response = get_model_response()
    if model_response:
        print("Found model response, parsing answer format...", file=sys.stderr)
        model_answer = parse_answer_format(model_response)
        
        if model_answer:
            print("\n=== Model Answer Parsed ===", file=sys.stderr)
            for key, value in model_answer.items():
                print(f"{key}: {value}", file=sys.stderr)
            
            # Compare answers
            answer_match = compare_answers(model_answer, expected_answer)
            if not answer_match:
                print("\nModel answer does not match expected answer", file=sys.stderr)
                return False
            print("\n✓ Model answer matches expected answer", file=sys.stderr)
            return True
        else:
            print("Warning: Could not parse answer format from model response", file=sys.stderr)
            return False
    else:
        print("No model response found", file=sys.stderr)
        return False


def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = asyncio.run(verify())
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()