Products Sales Analysis

PlaywrightShopping Admin

Generate comprehensive sales performance reports by extracting product metrics, analyzing revenue trends, identifying top performers, evaluating inventory turnover, and creating actionable insights.

Created by Fanqing Meng

2025-08-17

Data ExtractionComparative AnalysisContent Submission

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
claude-opus-4-1	0 /1	-	-	115.0s	8.0	71,720	1,140	72,860
claude-opus-4-5-high	0 /4			77.2s	8.0	341,623	953	342,575
claude-sonnet-4	0 /4			138.7s	8.5	291,568	1,153	292,721
claude-sonnet-4-5	0 /4			81.7s	9.0	369,703	1,205	370,908
claude-sonnet-4-high	0 /4			677.2s	21.8	8,778,912	3,396	8,782,308
claude-sonnet-4-low	0 /4			553.0s	22.5	9,855,795	3,406	9,859,201
deepseek-chat	0 /4			175.9s	13.3	152,628	1,218	153,846
deepseek-v3-1-terminus	0 /4			94.4s	5.5	48,907	412	49,319
deepseek-v3-1-terminus-thinking	0 /4			401.2s	7.0	64,398	7,698	72,096
deepseek-v3-2-chat	0 /4			132.7s	7.5	90,871	837	91,707
deepseek-v3-2-thinking	0 /4			73.9s	7.5	59,851	1,208	61,059
gemini-2-5-flash	0 /4			284.5s	41.8	4,713,866	8,108	4,721,974
gemini-2-5-pro	0 /4			258.6s	8.5	2,247,785	2,030	2,249,815
gemini-3-pro-high	0 /4			415.7s	21.5	9,503,916	10,624	9,514,540
gemini-3-pro-low	0 /4			388.8s	28.3	14,616,068	9,820	14,625,888
glm-4-5	0 /4			55.1s	7.3	60,797	973	61,771
gpt-4-1	0 /4			71.3s	11.0	1,568,380	492	1,568,872
gpt-4-1-mini	0 /4			395.5s	37.8	15,972,048	3,769	15,975,817
gpt-4-1-nano	0 /4			30.2s	10.5	107,344	397	107,742
gpt-5-2-high	0 /4			274.6s	10.0	608,276	8,119	616,395
gpt-5-high	0 /4			315.9s	9.0	515,308	8,368	523,675
gpt-5-low	0 /4			134.3s	9.0	367,260	3,804	371,064
gpt-5-medium	0 /4			233.7s	9.0	538,320	6,369	544,690
gpt-5-mini-high	0 /4			253.5s	11.0	685,056	15,507	700,563
gpt-5-mini-low	0 /4			62.6s	7.8	558,119	1,123	559,242
gpt-5-mini-medium	0 /4			129.6s	10.0	614,307	2,530	616,837
gpt-5-nano-high	0 /4			201.3s	17.3	529,131	29,396	558,527
gpt-5-nano-low	0 /4			113.5s	13.3	334,014	14,806	348,820
gpt-5-nano-medium	0 /4			211.4s	22.3	907,963	18,958	926,921
gpt-oss-120b	0 /4			27.6s	5.8	55,678	585	56,262
grok-4	0 /4			144.6s	5.8	143,237	4,315	147,552
grok-4-fast	0 /4			120.8s	12.0	2,838,644	5,020	2,843,664
grok-code-fast-1	0 /4			62.8s	8.0	233,392	2,894	236,286
kimi-k2-0711	0 /4			130.2s	8.5	72,917	493	73,410
kimi-k2-0905	0 /4			363.5s	10.0	717,656	657	718,313
o3	0 /4			104.1s	9.0	293,430	1,997	295,427
o4-mini	0 /4			455.4s	9.0	415,024	8,132	423,157
qwen-3-coder-plus	0 /4			1533.8s	18.5	7,303,668	2,123	7,305,791
qwen-3-max	0 /4			198.0s	8.8	236,301	292	236,593

Task State

WebArena

view WebArena environment setup for this task

Instruction

Perform a comprehensive products and sales analysis in the Magento Admin panel to identify inventory status and sales performance metrics.

Task Requirements:

if need to login, login with username 'admin' and password 'admin1234'
Analyze product inventory and catalog details, perform the following:
- Search for all products containing 'Yoga' in their name - count the exact number of results
- Clear the search and find the product with SKU 'WH11' - record its exact price
- Apply a filter to show only products with Quantity = 0.0000 - count how many products match
To identify top-selling products and revenue metrics, navigate to the Dashboard and from the Bestsellers table:
- Identify the product with lowest price and lowest quantity - record the product name and quantity sold
- Find the second cheapest product in the table - record its exact quantity sold
- Note the total Revenue amount displayed in the dashboard
Father all customers' information and demographics:
- Find customer 'Sarah Miller' - record her exact email address
- Count the total number of customers shown in the grid
Review order status and customer purchase history, go to orders of sales:
- Count the total number of orders with 'Pending' status
- Find the order ID of Grace Nguyen's order with the completed status and the most expensive price (starting with "000")
To provide a comprehensive report of all gathered data, compile all your findings and output them in the following exact format:

Plaintext


YogaProducts|count
WH11Price|price
ZeroQuantityProducts|count
LowestProduct|name:quantity
QuestLumaflexQuantity|quantity
DashboardRevenue|amount
SarahMillerEmail|email
TotalCustomers|count
PendingOrders|count
GraceNguyenOrderID|orderid

Example Output:

Plaintext


YogaProducts|XX
WH11Price|$XX.XX
ZeroQuantityProducts|XX
LowestProduct|Product Name Here:XX
QuestLumaflexQuantity|XX
DashboardRevenue|$XX.XX
SarahMillerEmail|email@example.com
TotalCustomers|XX
PendingOrders|X
GraceNguyenOrderID|00000XXXX

Verify

Python

import asyncio
import sys
import re
import os
import json
from pathlib import Path


def get_model_response():
    """
    Get the model's response from the MCP_MESSAGES environment variable.
    Returns the last assistant message text.
    """
    messages_path = os.getenv("MCP_MESSAGES")
    print(f"MCP_MESSAGES: {messages_path}")
    if not messages_path:
        print("Warning: MCP_MESSAGES environment variable not set", file=sys.stderr)
        return None

    try:
        with open(messages_path, "r") as f:
            messages = json.load(f)

        # Find the last assistant message
        for message in reversed(messages):
            if (
                message.get("role") == "assistant"
                and message.get("status") == "completed"
            ):
                content = message.get("content", [])
                for item in content:
                    if item.get("type") == "output_text":
                        return item.get("text", "")

        print("Warning: No assistant response found in messages", file=sys.stderr)
        return None
    except Exception as e:
        print(f"Error reading messages file: {str(e)}", file=sys.stderr)
        return None


def parse_answer_format(text):
    """
    Parse the ... format from the agent's output.
    Returns a dictionary with the parsed values.
    """
    if not text:
        print("Error: No text provided to parse", file=sys.stderr)
        return None

    # Look for ... pattern
    match = re.search(r"(.*?)", text, re.IGNORECASE | re.DOTALL)
    if not match:
        print("Error: No ... tags found in response", file=sys.stderr)
        return None

    answer_content = match.group(1).strip()
    if not answer_content:
        print("Error: Empty answer content", file=sys.stderr)
        return None

    # Parse each line
    result = {}
    lines = [line.strip() for line in answer_content.split("\n") if line.strip()]

    if len(lines) != 10:
        print(f"Error: Expected 10 lines in answer, got {len(lines)}", file=sys.stderr)
        print(f"Lines found: {lines}", file=sys.stderr)
        return None

    # Expected keys for validation
    expected_keys = [
        "YogaProducts", "WH11Price", "ZeroQuantityProducts", "LowestProduct",
        "QuestLumaflexQuantity", "DashboardRevenue", "SarahMillerEmail",
        "TotalCustomers", "PendingOrders", "GraceNguyenOrderID"
    ]

    for line in lines:
        if "|" not in line:
            print(f"Error: Line missing '|' separator: {line}", file=sys.stderr)
            return None
        
        parts = line.split("|", 1)
        if len(parts) != 2:
            print(f"Error: Invalid line format: {line}", file=sys.stderr)
            return None
            
        key, value = parts[0].strip(), parts[1].strip()
        
        if not key or not value:
            print(f"Error: Empty key or value in line: {line}", file=sys.stderr)
            return None
            
        result[key] = value

    # Validate all expected keys are present
    missing_keys = set(expected_keys) - set(result.keys())
    if missing_keys:
        print(f"Error: Missing required keys: {missing_keys}", file=sys.stderr)
        return None

    return result


def load_expected_answer(label_path):
    """
    Load the expected answer from label.txt file.
    Returns a dictionary with the expected values.
    """
    try:
        with open(label_path, "r") as f:
            lines = f.read().strip().split("\n")

        expected = {}
        for line in lines:
            if "|" in line:
                key, value = line.split("|", 1)
                expected[key.strip()] = value.strip()

        return expected
    except Exception as e:
        print(f"Error reading label file: {str(e)}", file=sys.stderr)
        return None


def compare_answers(model_answer, expected_answer):
    """
    Compare the model's answer with the expected answer.
    Returns True if all key information matches, False otherwise.
    """
    if not model_answer or not expected_answer:
        return False

    # Check each expected key
    mismatches = []
    for key, expected_value in expected_answer.items():
        model_value = model_answer.get(key, "")

        # Special handling for different types of values
        if key == "LowestProduct":
            # Check if product name and quantity match (format: "Product Name:quantity")
            if ":" in expected_value and ":" in model_value:
                expected_name, expected_qty = expected_value.rsplit(":", 1)
                model_name, model_qty = model_value.rsplit(":", 1)
                if expected_name != model_name or expected_qty != model_qty:
                    mismatches.append(
                        f"{key}: expected '{expected_value}', got '{model_value}'"
                    )
            else:
                if expected_value != model_value:
                    mismatches.append(
                        f"{key}: expected '{expected_value}', got '{model_value}'"
                    )

        elif key in ["WH11Price", "DashboardRevenue"]:
            # For price/amount fields, normalize format
            expected_clean = expected_value.replace("$", "").replace(",", "")
            model_clean = model_value.replace("$", "").replace(",", "")
            if expected_clean != model_clean:
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )

        elif key == "SarahMillerEmail":
            # Email should match exactly
            if model_value.lower() != expected_value.lower():
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )

        else:
            # Exact match for other fields
            if model_value != expected_value:
                mismatches.append(
                    f"{key}: expected '{expected_value}', got '{model_value}'"
                )

    if mismatches:
        print("\n=== Answer Comparison Mismatches ===", file=sys.stderr)
        for mismatch in mismatches:
            print(f"✗ {mismatch}", file=sys.stderr)
        return False

    print("\n=== Answer Comparison ===", file=sys.stderr)
    print("✓ All key information matches the expected answer", file=sys.stderr)
    return True


async def verify() -> bool:
    """
    Verifies that the products and sales analysis task has been completed correctly.
    First checks the model's answer against the expected label,
    then optionally verifies the actual state in the Magento Admin.
    """
    # Get the label file path
    label_path = Path(__file__).parent / "label.txt"

    # Load expected answer
    expected_answer = load_expected_answer(label_path)
    if not expected_answer:
        print("Error: Could not load expected answer from label.txt", file=sys.stderr)
        return False

    # Get model's response from MCP_MESSAGES
    model_response = get_model_response()
    if model_response:
        print("Found model response, parsing answer format...", file=sys.stderr)
        model_answer = parse_answer_format(model_response)

        if model_answer:
            print("\n=== Model Answer Parsed ===", file=sys.stderr)
            for key, value in model_answer.items():
                print(f"{key}: {value}", file=sys.stderr)

            # Compare answers
            answer_match = compare_answers(model_answer, expected_answer)
            if not answer_match:
                print("\nModel answer does not match expected answer", file=sys.stderr)
                return False
            print("\n✓ Model answer matches expected answer", file=sys.stderr)
            return True
        else:
            print(
                "Warning: Could not parse answer format from model response",
                file=sys.stderr,
            )
            return False
    else:
        print("No model response found", file=sys.stderr)
        return False


def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = asyncio.run(verify())
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()