Health Routine Optimization

PlaywrightShopping

Optimize health and wellness product selections by analyzing nutritional supplements, fitness equipment, creating personalized routines, and tracking health metrics for lifestyle improvements.

Created by Yaoqi Ye

2025-08-17

Data ExtractionComparative AnalysisContent Submission

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
claude-opus-4-5-high	4 /4			174.8s	19.5	1,367,685	3,304	1,370,989
claude-sonnet-4-5	4 /4			229.8s	23.5	1,956,262	3,973	1,960,235
gemini-3-pro-low	4 /4			179.3s	20.0	1,466,621	4,739	1,471,361
gpt-5-2-high	4 /4			319.4s	21.3	1,437,460	8,034	1,445,494
gpt-5-high	4 /4			777.9s	21.3	1,456,867	20,355	1,477,221
gpt-5-low	4 /4			372.9s	25.3	1,862,063	9,498	1,871,561
gpt-5-medium	4 /4			318.3s	21.3	1,405,667	11,085	1,416,752
grok-4-fast	4 /4			91.6s	17.8	1,139,252	3,883	1,143,135
gemini-2-5-pro	3 /4			186.7s	23.8	1,925,449	4,501	1,929,951
gemini-3-pro-high	3 /4			157.0s	19.5	1,418,477	4,842	1,423,319
grok-4	3 /4			249.9s	17.0	1,360,921	4,663	1,365,584
claude-sonnet-4	2 /4			340.1s	24.3	1,838,640	4,050	1,842,690
gpt-5-mini-high	2 /4			225.4s	33.3	2,793,779	9,102	2,802,881
grok-code-fast-1	2 /4			109.7s	24.3	1,812,979	4,656	1,817,635
claude-opus-4-1	1 /1	-	-	516.2s	23.0	1,719,919	3,654	1,723,573
gpt-5-mini-medium	1 /4			140.3s	24.3	1,523,695	4,137	1,527,832
claude-sonnet-4-high	0 /4			251.6s	21.0	1,728,302	4,142	1,732,444
claude-sonnet-4-low	0 /4			241.0s	21.3	1,806,300	4,228	1,810,528
deepseek-chat	0 /4			300.3s	22.8	1,479,410	990	1,480,400
deepseek-v3-1-terminus	0 /4			692.0s	20.8	1,372,163	1,675	1,373,838
deepseek-v3-1-terminus-thinking	0 /4			1394.9s	16.8	928,214	21,722	949,936
deepseek-v3-2-chat	0 /4			383.6s	24.0	1,659,543	3,367	1,662,910
deepseek-v3-2-thinking	0 /4			480.1s	46.3	3,907,997	6,259	3,914,257
gemini-2-5-flash	0 /4			369.0s	47.0	12,007,812	11,888	12,019,700
glm-4-5	0 /4			157.0s	15.5	813,494	2,402	815,896
gpt-4-1	0 /4			158.1s	24.8	2,200,114	1,029	2,201,143
gpt-4-1-mini	0 /4			252.0s	50.3	7,164,098	4,578	7,168,675
gpt-4-1-nano	0 /4			49.9s	12.3	469,410	522	469,931
gpt-5-mini-low	0 /4			92.0s	16.5	942,263	1,139	943,402
gpt-5-nano-high	0 /4			428.6s	32.8	2,260,798	65,185	2,325,983
gpt-5-nano-low	0 /4			99.9s	13.8	536,553	10,957	547,510
gpt-5-nano-medium	0 /4			144.2s	26.3	1,457,133	10,052	1,467,184
gpt-oss-120b	0 /4			30.4s	5.8	102,544	928	103,472
kimi-k2-0711	0 /4			287.4s	23.0	1,440,334	1,287	1,441,622
kimi-k2-0905	0 /4			298.1s	26.0	1,927,525	2,273	1,929,798
o3	0 /4			132.5s	14.8	587,685	2,011	589,696
o4-mini	0 /4			519.9s	17.0	799,134	20,550	819,684
qwen-3-coder-plus	0 /4			371.1s	26.3	2,281,215	2,689	2,283,903
qwen-3-max	0 /4			353.9s	26.0	2,245,401	1,016	2,246,416

Task State

WebArena

view WebArena environment setup for this task

Instruction

Task Requirements

Search for products with vitamin in Description and price range $0.00 to $99.99. Record total search results count.
In "Health & Household" category with price filter $0.00 - $99.99:
- Add "LOOPACELL AG13 LR44 L1154 357 76A A76 Button Cell Battery 10 Pack" to comparison
- Add "Energizer MAX C Batteries, Premium Alkaline C Cell Batteries (8 Battery Count)" to comparison
- Record each battery's price
- Verify comparison list has 2 items
Search Elmwood Inn Fine Teas, find "Elmwood Inn Fine Teas, Orange Vanilla Caffeine-free Fruit Infusion, 16-Ounce Pouch":
- Record SKU, rating percentage, and review count
- Add to cart with quantity 2
Search energy, sort by Relevance (descending):
- Find "V8 +Energy, Healthy Energy Drink, Steady Energy from Black and Green Tea, Pomegranate Blueberry, 8 Ounce Can ,Pack of 24"
- Record its position (1st, 2nd, 3rd, etc.)
- Add to cart with quantity 1
In cart:
- Record unique products count, total quantity, and subtotal
- Then update Elmwood tea quantity to 3
- Record new subtotal

Output Format

Plaintext


AdvancedSearchResults|XXXX
Battery1Name|LOOPACELL AG13 LR44
Battery1Price|$X.XX
Battery2Name|Energizer MAX C
Battery2Price|$XX.XX
ComparisonCount|X
TeaSKU|XXXXXXXXXX
TeaRating|XXX%
TeaReviews|X
V8Position|Xth
CartUniqueProducts|X
CartTotalQuantity|X
InitialSubtotal|$XX.XX
FinalSubtotal|$XX.XX

Verify

Python

import asyncio
import sys
import os
import json
import re
from pathlib import Path


def get_model_response():
    """
    Get the model's response from the MCP_MESSAGES environment variable.
    Returns the last assistant message text.
    """
    messages_path = os.getenv("MCP_MESSAGES")
    print(f"MCP_MESSAGES: {messages_path}")
    if not messages_path:
        print("Warning: MCP_MESSAGES environment variable not set", file=sys.stderr)
        return None

    try:
        with open(messages_path, "r") as f:
            messages = json.load(f)

        # Find the last assistant message
        for message in reversed(messages):
            if (
                message.get("role") == "assistant"
                and message.get("status") == "completed"
                and message.get("type") == "message"
            ):
                content = message.get("content", [])
                for item in content:
                    if item.get("type") == "output_text":
                        return item.get("text", "")

        print("Warning: No assistant response found in messages", file=sys.stderr)
        return None
    except Exception as e:
        print(f"Error reading messages file: {str(e)}", file=sys.stderr)
        return None

def parse_answer_format(text):
    """
    Parse the ... format from the agent's output.
    Returns a dictionary with the parsed values.
    """
    if not text:
        return None

    # Look for ... pattern
    match = re.search(r"(.*?)", text, re.IGNORECASE | re.DOTALL)
    if not match:
        return None

    answer_content = match.group(1).strip()

    # Parse each line
    result = {}
    lines = answer_content.split("\n")

    if len(lines) != 14:
        print(f"Error: Expected 14 lines in answer, got {len(lines)}", file=sys.stderr)
        return None

    for line in lines:
        if "|" in line:
            key, value = line.split("|", 1)
            result[key.strip()] = value.strip()

    return result

def load_expected_answer(label_path):
    """
    Load the expected answer from label.txt file.
    Returns a dictionary with the expected values.
    """
    try:
        with open(label_path, "r") as f:
            content = f.read().strip()

        # Parse the answer from the label file
        # The label file contains ... tags
        match = re.search(r"(.*?)", content, re.IGNORECASE | re.DOTALL)
        if match:
            answer_content = match.group(1).strip()
            lines = answer_content.split("\n")
        else:
            # Fallback: treat the whole file as answer content
            lines = content.split("\n")

        expected = {}
        for line in lines:
            if "|" in line:
                key, value = line.split("|", 1)
                expected[key.strip()] = value.strip()

        return expected
    except Exception as e:
        print(f"Error reading label file: {str(e)}", file=sys.stderr)
        return None

def compare_answers(model_answer, expected_answer):
    """
    Compare the model's answer with the expected answer.
    Returns True if all key information matches, False otherwise.
    """
    if not model_answer or not expected_answer:
        return False

    # Check each expected key
    mismatches = []
    for key, expected_value in expected_answer.items():
        model_value = model_answer.get(key, "")

        # Special handling for different types of values
        if key in ["Battery1Price", "Battery2Price", "InitialSubtotal", "FinalSubtotal"]:
            # For price fields, only support $XX.XX format
            # Check if model value has correct format
            if not model_value.startswith("$"):
                mismatches.append(
                    f"{key}: incorrect format - expected '$XX.XX' format, got '{model_value}'"
                )
            else:
                # Normalize and compare values
                expected_clean = expected_value.replace("$", "").replace(",", "")
                model_clean = model_value.replace("$", "").replace(",", "")
                if expected_clean != model_clean:
                    mismatches.append(
                        f"{key}: expected '{expected_value}', got '{model_value}'"
                    )

        else:
            # Exact match for other fields
            if model_value != expected_value:
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )

    if mismatches:
        print("\n=== Answer Comparison Mismatches ===", file=sys.stderr)
        for mismatch in mismatches:
            print(f"✗ {mismatch}", file=sys.stderr)
        return False

    print("\n=== Answer Comparison ===", file=sys.stderr)
    print("✓ All key information matches the expected answer", file=sys.stderr)
    return True

async def verify() -> bool:
    """
    Verifies that the health routine optimization task has been completed correctly.
    Checks the model's answer against the expected label.
    """
    # Get the label file path
    label_path = Path(__file__).parent / "label.txt"

    # Load expected answer
    expected_answer = load_expected_answer(label_path)
    if not expected_answer:
        print("Error: Could not load expected answer from label.txt", file=sys.stderr)
        return False

    # Get model's response from MCP_MESSAGES
    model_response = get_model_response()
    if model_response:
        print("Found model response, parsing answer format...", file=sys.stderr)
        model_answer = parse_answer_format(model_response)

        if model_answer:
            print("\n=== Model Answer Parsed ===", file=sys.stderr)
            for key, value in model_answer.items():
                print(f"{key}: {value}", file=sys.stderr)

            # Compare answers
            answer_match = compare_answers(model_answer, expected_answer)
            if not answer_match:
                print("\nModel answer does not match expected answer", file=sys.stderr)
                return False
            print("\n✓ Model answer matches expected answer", file=sys.stderr)
            return True
        else:
            print(
                "Warning: Could not parse answer format from model response",
                file=sys.stderr,
            )
            return False
    else:
        print("No model response found", file=sys.stderr)
        return False


def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = asyncio.run(verify())
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()