Config Parameter Audit

L3
ModelContextProtocolGithubEasyR1

Investigate configuration changes causing training instability by analyzing commits and identifying related memory issues.

Created by Xiangyan Liu
2025-08-15
Repository AnalysisIssue Management

Model Ranking

Click on the dots to view the trajectory of each task run
Model
Run Results
Pass@4
Pass^4
Avg Time
Avg Turns
Input Tokens
Output Tokens
Total Tokens
Gemini
gemini-3-pro-low
2
/4
191.6s
13.3
472,769
9,601
482,370
Gemini
gemini-3-pro-high
1
/4
175.7s
11.8
426,164
9,584
435,749
OpenAI
gpt-5-low
1
/4
532.2s
17.5
3,546,439
14,507
3,560,946
OpenAI
gpt-5-medium
1
/4
1304.5s
30.3
1,650,140
50,042
1,700,182
Claude
claude-opus-4-1
0
/1
--
577.8s
18.0
3,295,742
2,949
3,298,691
Claude
claude-opus-4-5-high
0
/4
188.0s
14.5
1,462,286
5,774
1,468,061
Claude
claude-sonnet-4
0
/4
153.0s
8.5
1,076,750
1,480
1,078,230
Claude
claude-sonnet-4-5
0
/4
170.3s
14.3
1,163,325
3,296
1,166,621
Claude
claude-sonnet-4-high
0
/4
190.8s
19.3
1,354,087
3,704
1,357,791
Claude
claude-sonnet-4-low
0
/4
193.4s
17.8
1,294,921
3,423
1,298,343
DeepSeek
deepseek-chat
0
/4
122.1s
5.0
334,815
371
335,186
DeepSeek
deepseek-v3-1-terminus
0
/4
399.8s
13.5
870,861
1,426
872,287
DeepSeek
deepseek-v3-1-terminus-thinking
0
/4
700.7s
10.0
492,168
15,533
507,701
DeepSeek
deepseek-v3-2-chat
0
/4
275.2s
22.3
1,194,371
4,703
1,199,075
DeepSeek
deepseek-v3-2-thinking
0
/4
463.5s
29.5
1,418,336
10,265
1,428,601
Gemini
gemini-2-5-flash
0
/4
144.6s
5.8
1,430,463
7,815
1,438,278
Gemini
gemini-2-5-pro
0
/4
54.0s
1.8
11,348
5,022
16,371
Z.ai
glm-4-5
0
/4
83.9s
8.8
518,349
1,215
519,564
OpenAI
gpt-4-1
0
/4
127.5s
20.5
807,751
1,365
809,117
OpenAI
gpt-4-1-mini
0
/4
107.6s
24.0
1,020,076
1,208
1,021,283
OpenAI
gpt-4-1-nano
0
/4
35.4s
6.0
419,674
779
420,453
OpenAI
gpt-5-high
0
/4
2683.7s
42.8
3,049,952
88,551
3,138,503
OpenAI
gpt-5-mini-high
0
/4
1394.0s
50.3
4,151,043
109,925
4,260,968
OpenAI
gpt-5-mini-low
0
/4
98.8s
8.3
1,210,856
2,342
1,213,197
OpenAI
gpt-5-mini-medium
0
/4
428.7s
32.8
2,791,586
25,107
2,816,692
OpenAI
gpt-5-nano-high
0
/4
743.9s
38.3
4,242,469
101,802
4,344,271
OpenAI
gpt-5-nano-low
0
/4
54.3s
8.3
260,366
3,127
263,493
OpenAI
gpt-5-nano-medium
0
/4
337.9s
24.8
2,023,591
47,788
2,071,379
OpenAI
gpt-oss-120b
0
/4
15.6s
3.5
48,985
690
49,674
Grok
grok-4
0
/4
309.3s
10.5
1,292,106
2,200
1,298,657
Grok
grok-4-fast
0
/4
375.7s
16.0
1,223,880
32,178
1,256,057
Grok
grok-code-fast-1
0
/4
414.0s
24.3
1,178,199
10,093
1,188,292
MoonshotAI
kimi-k2-0711
0
/4
325.4s
16.0
1,614,813
1,466
1,616,279
MoonshotAI
kimi-k2-0905
0
/4
414.4s
25.0
1,093,852
2,304
1,096,156
OpenAI
o3
0
/4
171.6s
11.0
1,053,736
3,837
1,057,572
OpenAI
o4-mini
0
/4
169.3s
4.3
598,373
2,922
601,295
Qwen
qwen-3-coder-plus
0
/4
443.2s
17.0
2,812,762
1,930
2,814,692
Qwen
qwen-3-max
0
/4
113.0s
19.0
844,927
1,202
846,129

Task State


Instruction

I need you to perform a deep investigation into recent configuration changes in our EasyR1 repository that may be causing training instability issues.

Task Requirements

1. Deep Commit Analysis

Find the exact commit SHA where the micro_batch_size_per_device_for_update parameter was changed from 4 to 1 in the examples/config.yaml file. Use GitHub API to:

  • Examine recent commits that modified examples/config.yaml
  • Get the specific commit diff showing this parameter change
  • Identify the commit author and timestamp

2. Related Parameter Investigation

In the same commit you found above, identify what value the micro_batch_size_per_device_for_experience parameter was changed to. Document:

  • The before value for this parameter
  • The after value for this parameter
  • The specific line numbers in the diff where these changes occurred

3. Issue Search and Verification

Search through all GitHub issues (both open and closed) to find issues that contain specific keywords. Identify all issue numbers where the issue title or body text contains any of these exact terms:

  • "OOM" (case insensitive)
  • "memory" (case insensitive)
  • "batch" (case insensitive)
  • "显存" (GPU memory in Chinese)

You must find and list ALL issues that contain any of these keywords in their titles or bodies, regardless of whether you think they're related to the parameter changes.

4. File Creation and Results

Create a file named exactly ANALYSIS_RESULTS.json in the repository root with this exact structure:

JSON
{
  "target_commit_sha": "full-40-character-commit-sha",
  "commit_author": "author-username", 
  "commit_date": "YYYY-MM-DD",
  "parameter_changes": {
    "micro_batch_size_per_device_for_update": {
      "before": 4,
      "after": 1,
      "line_number": 123
    },
    "micro_batch_size_per_device_for_experience": {
      "before": 16,
      "after": 2, 
      "line_number": 124
    }
  },
  "related_issue_number_list": [9, 46]
}

5. Verification Requirements

  • The commit SHA must be exactly 40 hexadecimal characters
  • The parameter values must match the actual repository changes
  • The issue number must reference a real issue in the repository
  • All data must be obtained through GitHub API analysis, not guesswork


Verify

*.py
Python
import sys
import os
import json
import requests
import re
from typing import Dict, Optional, Tuple
from dotenv import load_dotenv

load_dotenv(".mcp_env")


def _get_github_api(
    endpoint: str, headers: Dict[str, str]
) -> Tuple[bool, Optional[Dict]]:
    """Make a GET request to GitHub API and return (success, response)."""
    github_org = os.environ.get("GITHUB_EVAL_ORG")
    url = f"https://api.github.com/repos/{github_org}/EasyR1/{endpoint}"
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return True, response.json()
        elif response.status_code == 404:
            return False, None
        else:
            print(f"API error for {endpoint}: {response.status_code}", file=sys.stderr)
            return False, None
    except Exception as e:
        print(f"Exception for {endpoint}: {e}", file=sys.stderr)
        return False, None


def _get_analysis_results(headers: Dict[str, str]) -> Optional[Dict]:
    """Get ANALYSIS_RESULTS.json file content."""
    success, file_data = _get_github_api("contents/ANALYSIS_RESULTS.json", headers)
    if not success:
        return None

    # Decode base64 content
    import base64

    content = file_data.get("content", "")
    if content:
        try:
            decoded_content = base64.b64decode(content).decode("utf-8")
            return json.loads(decoded_content)
        except Exception as e:
            print(f"Error parsing JSON: {e}", file=sys.stderr)
            return None
    return None


def _verify_commit_data(results: Dict, headers: Dict[str, str]) -> bool:
    """Verify the commit data is accurate."""
    commit_sha = results.get("target_commit_sha")

    # Validate SHA format
    if not re.match(r"^[a-f0-9]{40}$", commit_sha, re.IGNORECASE):
        print(f"Error: Invalid commit SHA format: {commit_sha}", file=sys.stderr)
        return False

    # Get commit details
    success, commit_data = _get_github_api(f"commits/{commit_sha}", headers)
    if not success:
        print(f"Error: Commit {commit_sha} not found in repository", file=sys.stderr)
        return False

    # Verify author
    expected_author = results.get("commit_author")
    actual_author = commit_data.get("author", {}).get("login")
    if expected_author != actual_author:
        print(
            f"Error: Commit author mismatch. Expected: {expected_author}, Actual: {actual_author}",
            file=sys.stderr,
        )
        return False

    # Verify date format
    commit_date = results.get("commit_date")
    if not re.match(r"^\d{4}-\d{2}-\d{2}$", commit_date):
        print(
            f"Error: Invalid date format: {commit_date}. Expected YYYY-MM-DD",
            file=sys.stderr,
        )
        return False

    return True


def _verify_parameter_changes(results: Dict, headers: Dict[str, str]) -> bool:
    """Verify the parameter changes are accurate."""
    param_changes = results.get("parameter_changes", {})

    # Check required parameters exist
    required_params = [
        "micro_batch_size_per_device_for_update",
        "micro_batch_size_per_device_for_experience",
    ]
    for param in required_params:
        if param not in param_changes:
            print(f"Error: Missing parameter change data for: {param}", file=sys.stderr)
            return False

        change_data = param_changes[param]
        if not all(key in change_data for key in ["before", "after", "line_number"]):
            print(
                f"Error: Incomplete change data for parameter: {param}", file=sys.stderr
            )
            return False

    # Verify specific expected values based on known repository state
    update_param = param_changes.get("micro_batch_size_per_device_for_update", {})
    if update_param.get("before") != 4 or update_param.get("after") != 1:
        print(
            "Error: Incorrect values for micro_batch_size_per_device_for_update",
            file=sys.stderr,
        )
        return False

    experience_param = param_changes.get(
        "micro_batch_size_per_device_for_experience", {}
    )
    if experience_param.get("before") != 16 or experience_param.get("after") != 2:
        print(
            "Error: Incorrect values for micro_batch_size_per_device_for_experience",
            file=sys.stderr,
        )
        return False

    return True


def _get_all_issues_with_keywords(headers: Dict[str, str]) -> set:
    """Find all issues in repository that contain the required keywords."""
    required_keywords = ["oom", "memory", "batch", "显存"]
    keyword_issues = set()

    # Get all issues from repository (both open and closed)
    page = 1
    while True:
        success, issues = _get_github_api(
            f"issues?state=all&per_page=100&page={page}", headers
        )
        if not success or not issues:
            break

        for issue in issues:
            issue_number = issue.get("number")
            title = issue.get("title", "").lower()
            body = issue.get("body", "").lower() if issue.get("body") else ""
            issue_text = title + " " + body

            # Check if any keyword appears in title or body
            for keyword in required_keywords:
                if keyword.lower() in issue_text:
                    keyword_issues.add(issue_number)
                    break

        # If we got less than 100 issues, we're done
        if len(issues) < 100:
            break
        page += 1

    return keyword_issues


def _verify_issue_references(results: Dict, headers: Dict[str, str]) -> bool:
    """Verify the issue references contain the required keywords."""
    issue_number_list = results.get("related_issue_number_list")

    if not isinstance(issue_number_list, list) or len(issue_number_list) == 0:
        print(
            "Error: related_issue_number_list must be a non-empty list",
            file=sys.stderr,
        )
        return False

    # Required keywords to search for (case insensitive)
    required_keywords = ["oom", "memory", "batch", "显存"]

    # First, dynamically find all issues that contain the required keywords
    expected_issues = _get_all_issues_with_keywords(headers)
    print(expected_issues)
    provided_issues = set(issue_number_list)

    # Verify each provided issue contains at least one of the required keywords
    for issue_number in issue_number_list:
        if not isinstance(issue_number, int) or issue_number <= 0:
            print(
                f"Error: Invalid issue number format: {issue_number}", file=sys.stderr
            )
            return False

        # Get issue details
        success, issue_data = _get_github_api(f"issues/{issue_number}", headers)
        if not success:
            print(
                f"Error: Issue #{issue_number} not found in repository", file=sys.stderr
            )
            return False

        # Check if issue title or body contains any required keywords
        title = issue_data.get("title", "").lower()
        body = issue_data.get("body", "").lower() if issue_data.get("body") else ""
        issue_text = title + " " + body

        issue_has_keyword = False
        for keyword in required_keywords:
            if keyword.lower() in issue_text:
                issue_has_keyword = True
                break

        if not issue_has_keyword:
            print(
                f"Error: Issue #{issue_number} does not contain any required keywords: {required_keywords}",
                file=sys.stderr,
            )
            return False

    # Verify agent found exactly the same issues as our dynamic search
    if provided_issues != expected_issues:
        missing = expected_issues - provided_issues
        extra = provided_issues - expected_issues
        if missing:
            print(
                f"Error: Missing issues that contain required keywords: {missing}",
                file=sys.stderr,
            )
        if extra:
            print(
                f"Error: Extra issues that don't contain required keywords: {extra}",
                file=sys.stderr,
            )
        return False

    print(
        f"✓ Found all {len(issue_number_list)} issues containing required keywords: {issue_number_list}"
    )
    return True


def verify() -> bool:
    """
    Programmatically verify that the deep commit analysis meets the requirements.
    """
    # Get GitHub token
    github_token = os.environ.get("MCP_GITHUB_TOKEN")
    if not github_token:
        print("Error: MCP_GITHUB_TOKEN environment variable not set", file=sys.stderr)
        return False

    headers = {
        "Authorization": f"token {github_token}",
        "Accept": "application/vnd.github.v3+json",
    }

    print("Verifying deep commit analysis completion...")

    # 1. Check ANALYSIS_RESULTS.json exists and is valid JSON
    print("1. Checking ANALYSIS_RESULTS.json exists and is valid...")
    results = _get_analysis_results(headers)
    if not results:
        print("Error: ANALYSIS_RESULTS.json not found or invalid JSON", file=sys.stderr)
        return False

    print("✓ Found valid ANALYSIS_RESULTS.json")

    # 2. Verify commit data accuracy
    print("2. Verifying commit data accuracy...")
    if not _verify_commit_data(results, headers):
        return False

    print("✓ Commit SHA, author, and date verified")

    # 3. Verify parameter changes accuracy
    print("3. Verifying parameter changes accuracy...")
    if not _verify_parameter_changes(results, headers):
        return False

    print("✓ Parameter changes verified with correct before/after values")

    # 4. Verify issue references
    print("4. Verifying issue references...")
    if not _verify_issue_references(results, headers):
        return False

    print("\n✓ Task completed successfully!")
    print("Deep commit analysis results verified:")
    print(f"- Found target commit: {results.get('target_commit_sha')}")
    print(
        "- Verified parameter changes: micro_batch_size_per_device_for_update (4→1), micro_batch_size_per_device_for_experience (16→2)"
    )
    print(
        f"- Verified memory/performance issue correlations: {results.get('related_issue_number_list')}"
    )
    print("- All data obtained through accurate GitHub API analysis")

    return True


if __name__ == "__main__":
    success = verify()
    sys.exit(0 if success else 1)