Find Math Paper

FilesystemPapers

Search through academic papers to identify and locate mathematics-related content that satisfies specific mathematical criteria and research requirements.

Created by Xiangyan Liu

2025-08-12

Pattern AnalysisData Extraction

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
gemini-3-pro-high	4 /4			119.0s	8.8	364,433	2,429	366,862
gemini-3-pro-low	4 /4			74.1s	9.3	399,539	2,784	402,323
gpt-5-2-high	4 /4			91.5s	10.0	153,266	2,701	155,967
gpt-5-low	4 /4			117.4s	6.5	403,187	4,638	407,825
grok-4	4 /4			109.4s	6.0	88,527	3,934	92,461
grok-4-fast	4 /4			71.0s	8.0	96,805	3,315	100,120
grok-code-fast-1	4 /4			26.3s	7.3	132,219	404	135,313
o4-mini	4 /4			128.5s	19.3	178,984	6,980	185,964
claude-sonnet-4-high	2 /4			200.8s	13.8	2,868,483	2,797	2,871,279
deepseek-chat	2 /4			271.8s	26.5	624,257	2,528	626,785
gpt-5-high	2 /4			183.1s	4.8	46,304	5,601	51,905
gpt-5-medium	2 /4			152.7s	4.8	188,349	4,675	193,024
gpt-5-mini-medium	2 /4			128.2s	12.5	258,274	5,225	263,499
gpt-5-nano-low	2 /4			84.0s	11.8	426,567	10,538	437,105
claude-opus-4-1	1 /1	-	-	140.4s	6.0	421,671	1,378	423,049
claude-sonnet-4	1 /4			66.9s	6.3	296,938	1,339	298,277
claude-sonnet-4-low	1 /4			189.4s	10.0	1,799,702	2,190	1,801,892
deepseek-v3-2-chat	1 /4			131.8s	5.0	178,181	962	179,142
gemini-2-5-flash	1 /4			44.1s	6.8	821,116	3,987	825,103
kimi-k2-0711	1 /4			195.7s	13.8	581,037	1,257	582,294
qwen-3-max	1 /4			110.2s	6.5	301,460	489	301,948
claude-opus-4-5-high	0 /4			35.0s	3.0	10,098	704	10,802
claude-sonnet-4-5	0 /4			96.0s	13.5	867,975	2,734	870,708
deepseek-v3-1-terminus	0 /4			245.6s	6.3	255,045	758	255,803
deepseek-v3-1-terminus-thinking	0 /4			1095.8s	4.3	35,666	9,502	45,168
deepseek-v3-2-thinking	0 /4			60.4s	5.0	30,254	1,472	31,726
gemini-2-5-pro	0 /4			123.5s	10.3	1,413,366	6,884	1,420,249
glm-4-5	0 /4			84.8s	7.3	287,712	1,595	289,306
gpt-4-1	0 /4			47.9s	10.8	366,162	1,067	367,228
gpt-4-1-mini	0 /4			68.1s	23.5	906,603	1,545	908,148
gpt-4-1-nano	0 /4			39.4s	7.3	49,126	4,503	53,629
gpt-5-mini-high	0 /4			165.2s	5.0	13,575	5,243	18,818
gpt-5-mini-low	0 /4			37.1s	4.8	27,440	1,411	28,851
gpt-5-nano-high	0 /4			115.9s	10.3	709,345	13,742	723,087
gpt-5-nano-medium	0 /4			82.5s	12.0	539,198	8,708	547,906
gpt-oss-120b	0 /4			3.8s	1.8	1,984	147	2,130
kimi-k2-0905	0 /4			205.5s	20.8	874,309	1,762	876,072
o3	0 /4			538.1s	78.3	3,198,157	28,183	3,226,339
qwen-3-coder-plus	0 /4			1130.7s	31.8	9,160,133	3,595	9,163,728

Task State

Task Initial State Files

Download ZIP package to view the complete file structure

papers/ ├── 1707.06347.html ├── 2105.04165.html ├── 2201.11903.html ├── 2303.08774.html ├── 2306.08640.html ├── 2310.02255.html ├── 2310.08446.html ├── 2312.00849.html ├── 2312.07533.html ├── 2312.11805.html ├── 2402.00253.html ├── 2402.03300.html ├── 2403.05530.html ├── 2404.13046.html ├── 2404.14367.html ├── 2404.14396.html ├── 2405.09818.html ├── 2405.13911.html ├── 2405.16473.html ├── 2405.16640.html ├── 2406.08478.html ├── 2406.16852.html ├── 2406.17294.html ├── 2407.01284.html ├── 2407.01509.html ├── 2407.21783.html ├── 2408.03326.html ├── 2408.12528.html ├── 2409.19256.html ├── 2410.05993.html ├── 2410.06166.html ├── 2410.10563.html ├── 2410.13848.html ├── 2410.17885.html ├── 2410.21276.html ├── 2411.07975.html ├── 2411.10442.html ├── 2411.11930.html ├── 2411.14432.html ├── 2412.05271.html ├── 2412.08443.html ├── 2412.10302.html ├── 2412.15115.html ├── 2412.16720.html ├── 2412.17256.html ├── 2412.18319.html ├── 2412.20631.html ├── 2501.04686.html ├── 2501.06186.html ├── 2501.12599.html ├── 2501.12948.html ├── 2501.17811.html ├── 2502.01456.html ├── 2502.09621.html ├── 2502.10391.html ├── 2502.13923.html ├── 2503.01785.html ├── 2503.06520.html ├── 2503.06749.html ├── 2503.07065.html ├── 2503.07365.html ├── 2503.07536.html ├── 2503.10291.html ├── 2503.10615.html ├── 2503.12937.html ├── 2503.13939.html ├── 2503.14476.html ├── 2503.17352.html ├── 2503.18892.html ├── 2503.19786.html ├── 2503.20783.html ├── 2503.21620.html ├── 2503.21776.html ├── 2503.22679.html ├── 2504.02587.html ├── 2504.05599.html ├── 2504.07491.html ├── 2504.07934.html ├── 2504.07954.html ├── 2504.11455.html ├── 2504.14945.html ├── 2504.16656.html ├── 2505.00703.html └── arxiv_2025.bib

Instruction

Please use FileSystem tools to finish the following task:

You are given a directory containing multiple paper files. Please help me find a math-related benchmark paper. I don’t remember its name, but I remember it not only checks whether the answer is correct, but also analyzes whether the model suffers from insufficient knowledge, lacks generalization ability, or relies on rote memorization. After finding this paper, rename its corresponding HTML file to answer.html.

Verify

Python

#!/usr/bin/env python3
"""
Verification script for Find Math Paper Task
"""

import sys
from pathlib import Path
import os

def get_test_directory() -> Path:
    """Get the test directory from FILESYSTEM_TEST_DIR env var."""
    test_root = os.environ.get("FILESYSTEM_TEST_DIR")
    if not test_root:
        raise ValueError("FILESYSTEM_TEST_DIR environment variable is required")
    return Path(test_root)

def verify_answer_file_exists(test_dir: Path) -> bool:
    """Verify that answer.html exists in the papers directory."""
    answer_file = test_dir  / "answer.html"
    
    if not answer_file.exists():
        print("❌ File 'answer.html' not found")
        return False
    
    print("✅ answer.html found")
    return True

def verify_original_file_removed(test_dir: Path) -> bool:
    """Verify that the original file (2407.01284.html) no longer exists."""
    original_file = test_dir  / "2407.01284.html"
    
    if original_file.exists():
        print("❌ Original file 2407.01284.html still exists")
        return False
    
    print("✅ Original file has been renamed")
    return True

def main():
    """Main verification function."""
    test_dir = get_test_directory()
    print("🔍 Verifying Find Math Paper Task...")
    
    # Define verification steps
    verification_steps = [
        ("Answer File Exists", verify_answer_file_exists),
        ("Original File Renamed", verify_original_file_removed),
    ]
    
    # Run all verification steps
    all_passed = True
    for step_name, verify_func in verification_steps:
        print(f"\n--- {step_name} ---")
        if not verify_func(test_dir):
            all_passed = False
    
    # Final result
    print("\n" + "="*50)
    if all_passed:
        print("✅ Paper correctly renamed to answer.html!")
        print("🎉 Task verification: PASS")
        sys.exit(0)
    else:
        print("❌ Task verification: FAIL")
        sys.exit(1)

if __name__ == "__main__":
    main()