Find Math Paper

L3
ModelContextProtocolFilesystemPapers

Search through academic papers to identify and locate mathematics-related content that satisfies specific mathematical criteria and research requirements.

Created by Xiangyan Liu
2025-08-12
Pattern AnalysisData Extraction

Model Ranking

Click on the dots to view the trajectory of each task run
Model
Run Results
Pass@4
Pass^4
Avg Time
Avg Turns
Input Tokens
Output Tokens
Total Tokens
OpenAI
gpt-5-low
4
/4
117.4s
6.5
403,187
4,638
407,825
Grok
grok-4
4
/4
109.4s
6.0
88,527
3,934
92,461
Grok
grok-code-fast-1
4
/4
26.3s
7.3
132,219
404
135,313
OpenAI
o4-mini
4
/4
128.5s
19.3
178,984
6,980
185,964
Claude
claude-sonnet-4-high
2
/4
200.8s
13.8
2,868,483
2,797
2,871,279
DeepSeek
deepseek-chat
2
/4
271.8s
26.5
624,257
2,528
626,785
OpenAI
gpt-5-high
2
/4
183.1s
4.8
46,304
5,601
51,905
OpenAI
gpt-5-medium
2
/4
152.7s
4.8
188,349
4,675
193,024
OpenAI
gpt-5-mini-medium
2
/4
128.2s
12.5
258,274
5,225
263,499
OpenAI
gpt-5-nano-low
2
/4
84.0s
11.8
426,567
10,538
437,105
Claude
claude-opus-4-1
1
/1
--
140.4s
6.0
421,671
1,378
423,049
Claude
claude-sonnet-4
1
/4
66.9s
6.3
296,938
1,339
298,277
Claude
claude-sonnet-4-low
1
/4
189.4s
10.0
1,799,702
2,190
1,801,892
Gemini
gemini-2-5-flash
1
/4
44.1s
6.8
821,116
3,987
825,103
MoonshotAI
kimi-k2-0711
1
/4
195.7s
13.8
581,037
1,257
582,294
Qwen
qwen-3-max
1
/4
110.2s
6.5
301,460
489
301,948
Gemini
gemini-2-5-pro
0
/4
123.5s
10.3
1,413,366
6,884
1,420,249
Z.ai
glm-4-5
0
/4
84.8s
7.3
287,712
1,595
289,306
OpenAI
gpt-4-1
0
/4
47.9s
10.8
366,162
1,067
367,228
OpenAI
gpt-4-1-mini
0
/4
68.1s
23.5
906,603
1,545
908,148
OpenAI
gpt-4-1-nano
0
/4
39.4s
7.3
49,126
4,503
53,629
OpenAI
gpt-5-mini-high
0
/4
165.2s
5.0
13,575
5,243
18,818
OpenAI
gpt-5-mini-low
0
/4
37.1s
4.8
27,440
1,411
28,851
OpenAI
gpt-5-nano-high
0
/4
115.9s
10.3
709,345
13,742
723,087
OpenAI
gpt-5-nano-medium
0
/4
82.5s
12.0
539,198
8,708
547,906
OpenAI
gpt-oss-120b
0
/4
3.8s
1.8
1,984
147
2,130
MoonshotAI
kimi-k2-0905
0
/4
205.5s
20.8
874,309
1,762
876,072
OpenAI
o3
0
/4
538.1s
78.3
3,198,157
28,183
3,226,339
Qwen
qwen-3-coder-plus
0
/4
1130.7s
31.8
9,160,133
3,595
9,163,728

Task State

Task Initial State Files
Download ZIP package to view the complete file structure
papers/ ├── 1707.06347.html ├── 2105.04165.html ├── 2201.11903.html ├── 2303.08774.html ├── 2306.08640.html ├── 2310.02255.html ├── 2310.08446.html ├── 2312.00849.html ├── 2312.07533.html ├── 2312.11805.html ├── 2402.00253.html ├── 2402.03300.html ├── 2403.05530.html ├── 2404.13046.html ├── 2404.14367.html ├── 2404.14396.html ├── 2405.09818.html ├── 2405.13911.html ├── 2405.16473.html ├── 2405.16640.html ├── 2406.08478.html ├── 2406.16852.html ├── 2406.17294.html ├── 2407.01284.html ├── 2407.01509.html ├── 2407.21783.html ├── 2408.03326.html ├── 2408.12528.html ├── 2409.19256.html ├── 2410.05993.html ├── 2410.06166.html ├── 2410.10563.html ├── 2410.13848.html ├── 2410.17885.html ├── 2410.21276.html ├── 2411.07975.html ├── 2411.10442.html ├── 2411.11930.html ├── 2411.14432.html ├── 2412.05271.html ├── 2412.08443.html ├── 2412.10302.html ├── 2412.15115.html ├── 2412.16720.html ├── 2412.17256.html ├── 2412.18319.html ├── 2412.20631.html ├── 2501.04686.html ├── 2501.06186.html ├── 2501.12599.html ├── 2501.12948.html ├── 2501.17811.html ├── 2502.01456.html ├── 2502.09621.html ├── 2502.10391.html ├── 2502.13923.html ├── 2503.01785.html ├── 2503.06520.html ├── 2503.06749.html ├── 2503.07065.html ├── 2503.07365.html ├── 2503.07536.html ├── 2503.10291.html ├── 2503.10615.html ├── 2503.12937.html ├── 2503.13939.html ├── 2503.14476.html ├── 2503.17352.html ├── 2503.18892.html ├── 2503.19786.html ├── 2503.20783.html ├── 2503.21620.html ├── 2503.21776.html ├── 2503.22679.html ├── 2504.02587.html ├── 2504.05599.html ├── 2504.07491.html ├── 2504.07934.html ├── 2504.07954.html ├── 2504.11455.html ├── 2504.14945.html ├── 2504.16656.html ├── 2505.00703.html └── arxiv_2025.bib

Instruction

Please use FileSystem tools to finish the following task:

You are given a directory containing multiple paper files. Please help me find a math-related benchmark paper. I don’t remember its name, but I remember it not only checks whether the answer is correct, but also analyzes whether the model suffers from insufficient knowledge, lacks generalization ability, or relies on rote memorization. After finding this paper, rename its corresponding HTML file to answer.html.



Verify

*.py
Python
#!/usr/bin/env python3
"""
Verification script for Find Math Paper Task
"""

import sys
from pathlib import Path
import os

def get_test_directory() -> Path:
    """Get the test directory from FILESYSTEM_TEST_DIR env var."""
    test_root = os.environ.get("FILESYSTEM_TEST_DIR")
    if not test_root:
        raise ValueError("FILESYSTEM_TEST_DIR environment variable is required")
    return Path(test_root)

def verify_answer_file_exists(test_dir: Path) -> bool:
    """Verify that answer.html exists in the papers directory."""
    answer_file = test_dir  / "answer.html"
    
    if not answer_file.exists():
        print("❌ File 'answer.html' not found")
        return False
    
    print("✅ answer.html found")
    return True

def verify_original_file_removed(test_dir: Path) -> bool:
    """Verify that the original file (2407.01284.html) no longer exists."""
    original_file = test_dir  / "2407.01284.html"
    
    if original_file.exists():
        print("❌ Original file 2407.01284.html still exists")
        return False
    
    print("✅ Original file has been renamed")
    return True

def main():
    """Main verification function."""
    test_dir = get_test_directory()
    print("🔍 Verifying Find Math Paper Task...")
    
    # Define verification steps
    verification_steps = [
        ("Answer File Exists", verify_answer_file_exists),
        ("Original File Renamed", verify_original_file_removed),
    ]
    
    # Run all verification steps
    all_passed = True
    for step_name, verify_func in verification_steps:
        print(f"\n--- {step_name} ---")
        if not verify_func(test_dir):
            all_passed = False
    
    # Final result
    print("\n" + "="*50)
    if all_passed:
        print("✅ Paper correctly renamed to answer.html!")
        print("🎉 Task verification: PASS")
        sys.exit(0)
    else:
        print("❌ Task verification: FAIL")
        sys.exit(1)

if __name__ == "__main__":
    main()