Holiday Baking Competition

PlaywrightShopping

Research baking supplies for competition preparation including ingredient quality analysis, equipment comparisons, recipe optimization, and creating comprehensive shopping list with budget recommendations.

Created by Yaoqi Ye

2025-08-17

Search AggregationComparative AnalysisInventory Management

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
gemini-3-pro-high	4 /4			329.4s	31.8	4,236,988	7,993	4,244,981
gemini-3-pro-low	4 /4			399.7s	37.3	6,012,026	13,827	6,025,853
gpt-5-2-high	4 /4			690.1s	38.5	4,289,957	16,037	4,305,995
gpt-5-medium	4 /4			851.0s	37.0	4,145,903	26,409	4,172,312
grok-4	4 /4			412.3s	32.5	3,518,762	10,074	3,528,836
gpt-5-low	3 /4			943.4s	39.8	4,529,020	27,018	4,556,037
gpt-5-high	2 /4			1624.9s	35.5	3,751,234	44,824	3,796,058
claude-opus-4-5-high	1 /4			302.9s	29.8	3,329,315	4,625	3,333,940
grok-4-fast	1 /4			199.7s	31.0	3,714,335	8,885	3,723,220
claude-opus-4-1	0 /1	-	-	647.7s	26.0	2,520,129	3,457	2,523,586
claude-sonnet-4	0 /4			446.8s	29.8	2,987,007	5,005	2,992,012
claude-sonnet-4-5	0 /4			287.8s	25.0	2,524,325	3,975	2,528,300
claude-sonnet-4-high	0 /4			797.8s	42.3	6,609,572	8,650	6,618,222
claude-sonnet-4-low	0 /4			548.9s	38.8	5,376,418	6,870	5,383,288
deepseek-chat	0 /4			353.8s	24.3	1,647,381	1,496	1,648,876
deepseek-v3-1-terminus	0 /4			352.2s	18.3	1,205,392	1,536	1,206,928
deepseek-v3-1-terminus-thinking	0 /4			1167.2s	16.3	1,013,453	19,396	1,032,848
deepseek-v3-2-chat	0 /4			427.1s	26.5	1,949,080	3,593	1,952,673
deepseek-v3-2-thinking	0 /4			275.1s	19.5	1,377,626	4,559	1,382,185
gemini-2-5-flash	0 /4			370.7s	35.3	5,781,196	7,861	5,789,056
gemini-2-5-pro	0 /4			382.3s	37.0	5,102,165	13,453	5,115,617
glm-4-5	0 /4			523.2s	19.5	1,167,731	4,154	1,171,885
gpt-4-1	0 /4			196.2s	31.3	3,387,730	1,348	3,389,078
gpt-4-1-mini	0 /4			355.2s	56.5	10,656,981	11,326	10,668,307
gpt-4-1-nano	0 /4			55.5s	26.3	621,519	826	622,345
gpt-5-mini-high	0 /4			556.3s	44.3	5,629,157	26,639	5,655,796
gpt-5-mini-low	0 /4			34.8s	5.5	101,987	996	102,983
gpt-5-mini-medium	0 /4			451.9s	38.8	5,059,139	17,257	5,076,396
gpt-5-nano-high	0 /4			529.6s	46.3	3,794,676	74,410	3,869,085
gpt-5-nano-low	0 /4			83.3s	6.0	100,520	14,598	115,119
gpt-5-nano-medium	0 /4			110.4s	11.3	637,813	16,798	654,611
gpt-oss-120b	0 /4			30.5s	6.0	176,912	1,230	178,142
grok-code-fast-1	0 /4			155.3s	29.3	2,908,832	8,393	2,917,224
kimi-k2-0711	0 /4			353.3s	26.0	1,761,215	1,706	1,762,920
kimi-k2-0905	0 /4			724.9s	43.0	4,958,349	3,279	4,961,628
o3	0 /4			113.4s	15.3	447,214	2,716	449,930
o4-mini	0 /4			999.3s	21.8	1,364,244	37,124	1,401,367
qwen-3-coder-plus	0 /4			917.9s	17.8	2,947,785	1,307	2,949,091
qwen-3-max	0 /4			804.5s	35.8	3,966,722	970	3,967,691

Task State

WebArena

view WebArena environment setup for this task

Instruction

Task Requirements:

Search 'gingerbread', sort by price (high to low):
- Add most expensive product to comparison list
- Record SKU of second most expensive product
Search 'cookie' with price range $20.00-$40.00:
- Find product with highest rating % and at least 5 reviews in the first 2 pages (if tied, choose lowest price)
- Record SKU and rating %
- Select "Cookies: Oatmeal Chocolate Chunk" flavor if required
- Add to cart with quantity 2
Search 'chocolate', sort by price (low to high):
- Find cheapest product with at least 1 review
- Record price and review count
- Select "Peanut Butter Flavor" if required
- Add to cart with quantity 3
In cart:
- Update cookie quantity from 2 to 5
- Record cart subtotal and total items count
Search 'gingerbread', go to page 2:
- Find third product on page 2
- Record SKU, price, and manufacturer

Output Format:

Plaintext


SecondGingerbreadSKU|sku
HighestRatedCookieSKURating|sku:rating%
CheapestChocolatePriceReviews|$price:reviews
CartSubtotalAfterUpdate|$amount
TotalCartItems|count
Page2ThirdProductSKUPrice|sku:$price
ProductManufacturer|manufacturer

Verify

Python

import asyncio
import sys
import re
import os
import json
from pathlib import Path


def get_model_response():
    """
    Get the model's response from the MCP_MESSAGES environment variable.
    Returns the last assistant message text.
    """
    messages_path = os.getenv("MCP_MESSAGES")
    print(f"MCP_MESSAGES: {messages_path}")
    if not messages_path:
        print("Warning: MCP_MESSAGES environment variable not set", file=sys.stderr)
        return None

    try:
        with open(messages_path, "r") as f:
            messages = json.load(f)

        # Find the last assistant message
        for message in reversed(messages):
            if (
                message.get("role") == "assistant"
                and message.get("status") == "completed"
                and message.get("type") == "message"
            ):
                content = message.get("content", [])
                for item in content:
                    if item.get("type") == "output_text":
                        return item.get("text", "")

        print("Warning: No assistant response found in messages", file=sys.stderr)
        return None
    except Exception as e:
        print(f"Error reading messages file: {str(e)}", file=sys.stderr)
        return None


def parse_answer_format(text):
    """
    Parse the ... format from the agent's output.
    Returns a dictionary with the parsed values.
    """
    if not text:
        return None

    # Look for ... pattern
    match = re.search(r"(.*?)", text, re.IGNORECASE | re.DOTALL)
    if not match:
        return None

    answer_content = match.group(1).strip()

    # Parse each line
    result = {}
    lines = answer_content.split("\n")

    if len(lines) != 7:
        print(f"Error: Expected 7 lines in answer, got {len(lines)}", file=sys.stderr)
        return None

    for line in lines:
        if "|" in line:
            key, value = line.split("|", 1)
            result[key.strip()] = value.strip()

    return result


def load_expected_answer(label_path):
    """
    Load the expected answer from label.txt file.
    Returns a dictionary with the expected values.
    """
    try:
        with open(label_path, "r") as f:
            lines = f.read().strip().split("\n")

        expected = {}
        for line in lines:
            if "|" in line:
                key, value = line.split("|", 1)
                expected[key.strip()] = value.strip()

        return expected
    except Exception as e:
        print(f"Error reading label file: {str(e)}", file=sys.stderr)
        return None


def compare_answers(model_answer, expected_answer):
    """
    Compare the model's answer with the expected answer.
    Returns True if all key information matches, False otherwise.
    """
    if not model_answer or not expected_answer:
        return False

    # Check each expected key
    mismatches = []
    for key, expected_value in expected_answer.items():
        model_value = model_answer.get(key, "")

        # Special handling for different types of values
        if key == "SecondGingerbreadSKU":
            # SKU should match exactly (case-insensitive)
            if model_value.upper() != expected_value.upper():
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )
                
        elif key in ["CartSubtotalAfterUpdate"]:
            # For price fields, only support $XX.XX format
            # Check if model value has correct format
            if not model_value.startswith("$"):
                mismatches.append(
                    f"{key}: incorrect format - expected '$XX.XX' format, got '{model_value}'"
                )
            else:
                # Normalize and compare values
                expected_clean = expected_value.replace("$", "").replace(",", "")
                model_clean = model_value.replace("$", "").replace(",", "")
                # Allow some tolerance for price calculations (within $0.01)
                try:
                    expected_float = float(expected_clean)
                    model_float = float(model_clean)
                    if abs(expected_float - model_float) > 0.01:
                        mismatches.append(
                            f"{key}: expected '{expected_value}', got '{model_value}'"
                        )
                except ValueError:
                    if expected_value != model_value:
                        mismatches.append(
                            f"{key}: expected '{expected_value}', got '{model_value}'"
                        )
                    
        elif key in ["TotalCartItems"]:
            # Should be a number
            if model_value != expected_value:
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )
                
        elif key in ["HighestRatedCookieSKURating", "CheapestChocolatePriceReviews", "Page2ThirdProductSKUPrice"]:
            # Colon-separated fields (sku:rating, price:reviews, sku:price)
            if ":" in expected_value and ":" in model_value:
                expected_parts = expected_value.split(":", 1)
                model_parts = model_value.split(":", 1)
                if len(expected_parts) == 2 and len(model_parts) == 2:
                    # For price fields, normalize the price part
                    if key == "CheapestChocolatePriceReviews":
                        # Check if price part has correct format ($XX.XX)
                        if not model_parts[0].startswith("$"):
                            mismatches.append(
                                f"{key}: incorrect format - price part should start with '$', got '{model_value}'"
                            )
                        else:
                            expected_price = expected_parts[0].replace("$", "").replace(",", "")
                            model_price = model_parts[0].replace("$", "").replace(",", "")
                            try:
                                if abs(float(expected_price) - float(model_price)) > 0.01 or expected_parts[1] != model_parts[1]:
                                    mismatches.append(
                                        f"{key}: expected '{expected_value}', got '{model_value}'"
                                    )
                            except ValueError:
                                if expected_value != model_value:
                                    mismatches.append(
                                        f"{key}: expected '{expected_value}', got '{model_value}'"
                                    )
                    elif key == "Page2ThirdProductSKUPrice":
                        # Check if price part has correct format ($XX.XX)
                        if not model_parts[1].startswith("$"):
                            mismatches.append(
                                f"{key}: incorrect format - price part should start with '$', got '{model_value}'"
                            )
                        else:
                            expected_price = expected_parts[1].replace("$", "").replace(",", "")
                            model_price = model_parts[1].replace("$", "").replace(",", "")
                            try:
                                if expected_parts[0] != model_parts[0] or abs(float(expected_price) - float(model_price)) > 0.01:
                                    mismatches.append(
                                        f"{key}: expected '{expected_value}', got '{model_value}'"
                                    )
                            except ValueError:
                                if expected_value != model_value:
                                    mismatches.append(
                                        f"{key}: expected '{expected_value}', got '{model_value}'"
                                    )
                    else:
                        # For rating fields, exact match
                        if expected_value != model_value:
                            mismatches.append(
                                f"{key}: expected '{expected_value}', got '{model_value}'"
                            )
                else:
                    mismatches.append(
                        f"{key}: expected '{expected_value}', got '{model_value}'"
                    )
            else:
                if expected_value != model_value:
                    mismatches.append(
                        f"{key}: expected '{expected_value}', got '{model_value}'"
                    )
        else:
            # Exact match for other fields (like ProductManufacturer)
            if model_value != expected_value:
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )

    if mismatches:
        print("\n=== Answer Comparison Mismatches ===", file=sys.stderr)
        for mismatch in mismatches:
            print(f"✗ {mismatch}", file=sys.stderr)
        return False

    print("\n=== Answer Comparison ===", file=sys.stderr)
    print("✓ All key information matches the expected answer", file=sys.stderr)
    return True


async def verify() -> bool:
    """
    Verifies that the holiday baking competition task has been completed correctly.
    Checks the model's answer against the expected label.
    """
    # Get the label file path
    label_path = Path(__file__).parent / "label.txt"

    # Load expected answer
    expected_answer = load_expected_answer(label_path)
    if not expected_answer:
        print("Error: Could not load expected answer from label.txt", file=sys.stderr)
        return False

    # Get model's response from MCP_MESSAGES
    model_response = get_model_response()
    if model_response:
        print("Found model response, parsing answer format...", file=sys.stderr)
        model_answer = parse_answer_format(model_response)

        if model_answer:
            print("\n=== Model Answer Parsed ===", file=sys.stderr)
            for key, value in model_answer.items():
                print(f"{key}: {value}", file=sys.stderr)

            # Compare answers
            answer_match = compare_answers(model_answer, expected_answer)
            if not answer_match:
                print("\nModel answer does not match expected answer", file=sys.stderr)
                return False
            print("\n✓ Model answer matches expected answer", file=sys.stderr)
            return True
        else:
            print(
                "Warning: Could not parse answer format from model response",
                file=sys.stderr,
            )
            return False
    else:
        print("No model response found", file=sys.stderr)
        return False


def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = asyncio.run(verify())
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()