LLM Research Summary

L3
ModelContextProtocolPlaywrightReddit

Aggregate and analyze LLM research discussions across multiple forums, collect trending topics, compile technical insights, and create comprehensive summary post with community engagement.

Created by Fanqing Meng
2025-08-12
Data ExtractionSearch AggregationContent SubmissionUser Interaction

Model Ranking

Click on the dots to view the trajectory of each task run
Model
Run Results
Pass@4
Pass^4
Avg Time
Avg Turns
Input Tokens
Output Tokens
Total Tokens
MoonshotAI
kimi-k2-0905
3
/4
306.9s
19.0
1,016,705
3,225
1,019,930
Claude
claude-sonnet-4
2
/4
324.8s
20.0
1,066,641
4,538
1,071,180
Claude
claude-sonnet-4-high
2
/4
182.2s
17.3
809,609
3,997
813,606
Claude
claude-sonnet-4-low
1
/4
160.5s
17.0
795,998
4,071
800,069
Qwen
qwen-3-coder-plus
1
/4
394.1s
22.3
2,070,142
3,663
2,073,805
Claude
claude-opus-4-1
0
/1
--
403.9s
20.0
984,735
3,636
988,371
DeepSeek
deepseek-chat
0
/4
312.5s
19.0
857,672
2,588
860,260
Gemini
gemini-2-5-flash
0
/4
165.7s
23.5
3,102,177
10,433
3,112,610
Gemini
gemini-2-5-pro
0
/4
350.8s
21.8
1,336,379
9,754
1,346,133
Z.ai
glm-4-5
0
/4
178.0s
15.0
541,335
4,089
545,424
OpenAI
gpt-4-1
0
/4
79.7s
12.8
455,783
1,363
457,147
OpenAI
gpt-4-1-mini
0
/4
316.5s
28.8
3,059,008
3,334
3,062,341
OpenAI
gpt-4-1-nano
0
/4
152.7s
15.0
94,271
525
94,796
OpenAI
gpt-5-high
0
/4
902.5s
13.0
349,845
28,060
377,905
OpenAI
gpt-5-low
0
/4
524.1s
17.3
592,551
22,608
615,160
OpenAI
gpt-5-medium
0
/4
463.2s
16.5
648,333
23,031
671,363
OpenAI
gpt-5-mini-high
0
/4
381.3s
24.3
2,035,513
26,227
2,061,740
OpenAI
gpt-5-mini-low
0
/4
101.3s
15.3
616,035
4,454
620,489
OpenAI
gpt-5-mini-medium
0
/4
175.5s
16.3
778,758
12,478
791,236
OpenAI
gpt-5-nano-high
0
/4
442.4s
26.3
773,850
31,239
805,089
OpenAI
gpt-5-nano-low
0
/4
366.8s
13.3
94,059
13,661
107,720
OpenAI
gpt-5-nano-medium
0
/4
309.3s
20.3
351,957
19,224
371,181
OpenAI
gpt-oss-120b
0
/4
21.7s
5.5
32,913
1,216
34,129
Grok
grok-4
0
/4
150.9s
13.3
447,141
5,068
452,210
Grok
grok-code-fast-1
0
/4
60.9s
14.0
490,528
4,922
495,450
MoonshotAI
kimi-k2-0711
0
/4
235.5s
20.5
859,018
2,252
861,270
OpenAI
o3
0
/4
202.1s
15.8
478,381
8,293
486,674
OpenAI
o4-mini
0
/4
382.3s
12.8
356,164
19,936
376,100
Qwen
qwen-3-max
0
/4
97.5s
14.0
493,530
1,173
494,703

Task State

WebArena
view WebArena environment setup for this task

Instruction

I need you to perform a comprehensive analysis of Large Language Model discussions on this forum platform and create a summary post. This analysis will help me understand the community's engagement with LLM topics.

Task Requirements:

  1. Register a new account with username 'llm_analyst_2024' and password 'LLMExpert2024!'

  2. After successful registration, navigate to the MachineLearning forum and analyze posts about Large Language Models:

    • Count all posts on the first page that contain 'GPT', 'ChatGPT', or 'LLM' in their titles
  3. Identify the top 3 posts with the highest upvote counts from your LLM-related posts list and record their complete details

  4. Navigate to the deeplearning forum and go to page 2:

    • Find the post with the most comments on page 2
    • Record its exact title and comment count
  5. Create a new submission in the MachineLearning forum with:

    • Title: "LLM Research Summary: GPT Discussions Analysis [2024]"
    • Body text must be EXACTLY these lines without anything (keep the keys as-is, only replace the values after the pipe, follow the markdown format):
Plaintext
- Total_LLM_Posts|FILL_VALUE
- Top1_Title|FILL_VALUE
- Top1_Upvotes|FILL_VALUE
- Top1_Date|FILL_VALUE
- Top2_Title|FILL_VALUE
- Top2_Upvotes|FILL_VALUE
- Top2_Date|FILL_VALUE
- Top3_Title|FILL_VALUE
- Top3_Upvotes|FILL_VALUE
- Top3_Date|FILL_VALUE
- Deeplearning_MostDiscussed|FILL_VALUE
- Deeplearning_Comments|FILL_VALUE


Verify

*.py
Python
import asyncio
import sys
import re
import os
from pathlib import Path
from playwright.async_api import (
    async_playwright,
    TimeoutError as PlaywrightTimeoutError,
)

# 从环境变量读取 base_url,默认回退到本地
BASE_URL = os.getenv("WEBARENA_BASE_URL", "http://localhost:9999").rstrip("/")


def parse_key_value_format(text):
    """
    Parse the Key|Value format from the submission body.
    Handles both pipe (|) and colon (:) separators for compatibility.
    """
    data = {}
    
    # Try to parse with pipe separator first (expected format)
    lines = text.strip().split('\n')
    for line in lines:
        line = line.strip()
        if not line:
            continue
        
        # Remove markdown list prefix if present
        if line.startswith('- '):
            line = line[2:]
        elif line.startswith('* '):
            line = line[2:]
        
        # Try pipe separator first
        if '|' in line:
            parts = line.split('|', 1)
            if len(parts) == 2:
                key = parts[0].strip()
                value = parts[1].strip()
                data[key] = value
        # Fallback to colon separator for label.txt compatibility
        elif ':' in line:
            parts = line.split(':', 1)
            if len(parts) == 2:
                key = parts[0].strip()
                value = parts[1].strip()
                data[key] = value
    
    return data


def normalize_text(text):
    """
    Normalize text for comparison by handling different quote styles and whitespace.
    """
    if not isinstance(text, str):
        return str(text)

    # Replace various quote styles with standard quotes
    text = text.replace(""", "'").replace(""", "'")
    text = text.replace('"', '"').replace('"', '"')

    # Normalize whitespace
    text = " ".join(text.split())

    return text.strip()


async def verify() -> bool:
    """
    Verifies that the LLM analysis task has been completed correctly.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        try:
            # Navigate to the main page
            print("Navigating to forum...", file=sys.stderr)
            await page.goto(f"{BASE_URL}/", wait_until="networkidle")

            # Check if logged in as llm_analyst_2024
            user_button = page.locator('button:has-text("llm_analyst_2024")')
            if not await user_button.count():
                # Try to login
                print("Not logged in, attempting to login...", file=sys.stderr)

                await page.click('a:has-text("Log in")')
                await page.wait_for_load_state("networkidle")

                await page.fill('input[name="_username"]', "llm_analyst_2024")
                await page.fill('input[name="_password"]', "LLMExpert2024!")

                await page.click('button:has-text("Log in")')
                await page.wait_for_load_state("networkidle")

                user_button = page.locator('button:has-text("llm_analyst_2024")')
                if not await user_button.count():
                    print("Error: Login failed for llm_analyst_2024", file=sys.stderr)
                    return False

                print("Successfully logged in as llm_analyst_2024", file=sys.stderr)
            else:
                print("Already logged in as llm_analyst_2024", file=sys.stderr)

            # Navigate to MachineLearning forum
            print("Navigating to MachineLearning forum...", file=sys.stderr)
            await page.goto(
                f"{BASE_URL}/f/MachineLearning", wait_until="networkidle"
            )

            # Look for the submission with our specific title
            print(
                "Looking for submission 'LLM Research Summary: GPT Discussions Analysis [2024]'...",
                file=sys.stderr,
            )
            post_link = page.locator(
                'a:has-text("LLM Research Summary: GPT Discussions Analysis [2024]")'
            )

            if not await post_link.count():
                print(
                    "Error: Could not find submission with required title",
                    file=sys.stderr,
                )
                return False

            # Click on the submission to view its content
            await post_link.first.click()
            await page.wait_for_load_state("networkidle")

            # Extract the submission body content
            # Try multiple possible selectors for the post body
            post_content = None
            selectors = [
                ".submission__body",
                ".post-body",
                ".RichText",
                '[class*="RichText"]',
                'div:has(> p:has-text("Total_LLM_Posts"))',
                'div:has-text("Total_LLM_Posts"):has-text("Deeplearning_Comments")',
            ]

            for selector in selectors:
                content_element = page.locator(selector)
                if await content_element.count():
                    post_content = await content_element.first.inner_text()
                    if "Total_LLM_Posts" in post_content:
                        print(
                            f"Found submission content using selector: {selector}",
                            file=sys.stderr,
                        )
                        break

            if not post_content or "Total_LLM_Posts" not in post_content:
                print(
                    "Error: Could not find submission body with required format",
                    file=sys.stderr,
                )
                return False

            print("Submission content found, parsing data...", file=sys.stderr)
            print(f"Raw content: {post_content[:200]}...", file=sys.stderr)

            # Parse the Key: Value format
            extracted_data = parse_key_value_format(post_content)
            print(f"Extracted data: {extracted_data}", file=sys.stderr)

            # Load expected values from label.txt
            label_path = Path(__file__).parent / "label.txt"
            if label_path.exists():
                with open(label_path, "r") as f:
                    expected_text = f.read().strip()
                expected_data = parse_key_value_format(expected_text)
                print("Loaded expected values from label.txt", file=sys.stderr)

            # Verify all required keys are present
            required_keys = [
                "Total_LLM_Posts",
                "Top1_Title",
                "Top1_Upvotes",
                "Top1_Date",
                "Top2_Title",
                "Top2_Upvotes",
                "Top2_Date",
                "Top3_Title",
                "Top3_Upvotes",
                "Top3_Date",
                "Deeplearning_MostDiscussed",
                "Deeplearning_Comments",
            ]

            missing_keys = []
            for key in required_keys:
                if key not in extracted_data:
                    missing_keys.append(key)

            if missing_keys:
                print(
                    f"Error: Missing required keys: {', '.join(missing_keys)}",
                    file=sys.stderr,
                )
                return False

            # Validate data format and content
            errors = []

            # Check Total_LLM_Posts is a number and matches expected
            try:
                total_posts = int(extracted_data["Total_LLM_Posts"])
                if "expected_data" in locals() and "Total_LLM_Posts" in expected_data:
                    expected_total = int(expected_data["Total_LLM_Posts"])
                    if total_posts != expected_total:
                        errors.append(
                            f"Total_LLM_Posts mismatch: got {total_posts}, expected {expected_total}"
                        )
                elif total_posts < 5:  # Based on exploration, should be at least 5
                    errors.append(f"Total_LLM_Posts seems too low: {total_posts}")
            except ValueError:
                errors.append(
                    f"Total_LLM_Posts must be a number, got: {extracted_data['Total_LLM_Posts']}"
                )

            # If we have expected data, compare against it
            if "expected_data" in locals():
                # Compare each field
                for key in required_keys:
                    if key in expected_data and key in extracted_data:
                        expected_val = normalize_text(expected_data[key])
                        actual_val = normalize_text(extracted_data[key])

                        # For numeric fields, compare as integers
                        if (
                            "Upvotes" in key
                            or "Comments" in key
                            or key == "Total_LLM_Posts"
                        ):
                            try:
                                expected_int = int(expected_val)
                                actual_int = int(actual_val)
                                if expected_int != actual_int:
                                    errors.append(
                                        f"{key} mismatch: got {actual_int}, expected {expected_int}"
                                    )
                            except ValueError:
                                errors.append(
                                    f"{key} should be numeric: got '{actual_val}'"
                                )
                        else:
                            # For text fields, compare normalized text
                            if expected_val != actual_val:
                                errors.append(
                                    f"{key} mismatch: got '{actual_val}', expected '{expected_val}'"
                                )

            else:
                # If no expected data, just do basic validation
                for key in required_keys:
                    if key not in extracted_data:
                        errors.append(f"Missing required key: {key}")
                    elif (
                        not extracted_data[key] or extracted_data[key] == "[FILL_VALUE]"
                    ):
                        errors.append(f"{key} was not filled in")

            # Verify upvotes are in descending order for top 3
            try:
                top1_votes = int(extracted_data["Top1_Upvotes"])
                top2_votes = int(extracted_data["Top2_Upvotes"])
                top3_votes = int(extracted_data["Top3_Upvotes"])

                if not (top1_votes >= top2_votes >= top3_votes):
                    errors.append(
                        f"Top posts should be ordered by upvotes: {top1_votes} >= {top2_votes} >= {top3_votes}"
                    )
            except (ValueError, KeyError):
                pass  # Already reported above

            if errors:
                print(
                    "Error: Validation failed with the following issues:",
                    file=sys.stderr,
                )
                for error in errors:
                    print(f"  - {error}", file=sys.stderr)
                return False

            # All checks passed
            print("Success: LLM analysis task completed successfully.")
            print("- Account llm_analyst_2024 verified")
            print(
                "- Submission 'LLM Research Summary: GPT Discussions Analysis [2024]' found"
            )
            print(
                f"- Total LLM-related posts analyzed: {extracted_data['Total_LLM_Posts']}"
            )
            print("- Top 3 posts by upvotes identified and documented")
            print(
                f"- Deeplearning forum page 2 most discussed post: {extracted_data['Deeplearning_MostDiscussed']}"
            )
            print("- All data in correct Key: Value format with 12 lines")
            return True

        except PlaywrightTimeoutError as e:
            print(f"Error: Timeout occurred - {str(e)}", file=sys.stderr)
            return False
        except Exception as e:
            print(f"Error: Unexpected error - {str(e)}", file=sys.stderr)
            return False
        finally:
            await browser.close()


def main():
    """
    Executes the verification process and exits with a status code.
    """
    result = asyncio.run(verify())
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()