Author Folders

FilesystemPapers

Analyze academic papers to identify and organize by author, creating separate folders for frequent authors (≥4 papers) and prolific 2025 authors (≥3 papers).

Created by Xiangyan Liu

2025-08-12

Data ExtractionFile OrganizationPattern Analysis

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
claude-opus-4-1	0 /1	-	-	656.0s	18.0	1,366,936	11,172	1,378,108
claude-sonnet-4	0 /4			273.4s	18.8	1,496,283	10,393	1,506,676
claude-sonnet-4-high	0 /4			968.1s	45.8	12,745,690	33,595	12,779,285
claude-sonnet-4-low	0 /4			1740.7s	56.0	19,518,892	31,055	19,549,947
deepseek-chat	0 /4			921.6s	45.0	2,482,504	13,584	2,496,088
gemini-2-5-flash	0 /4			25.3s	4.5	16,054	2,694	18,748
gemini-2-5-pro	0 /4			173.3s	20.3	339,714	11,967	351,681
glm-4-5	0 /4			67.2s	5.5	97,444	2,273	99,718
gpt-4-1	0 /4			195.5s	6.5	1,961,823	2,853	1,964,676
gpt-4-1-mini	0 /4			67.2s	7.5	483,243	1,923	485,166
gpt-4-1-nano	0 /4			49.5s	19.3	922,344	1,367	923,711
gpt-5-high	0 /4			494.8s	6.0	47,444	10,977	58,420
gpt-5-low	0 /4			401.5s	7.0	402,735	17,091	419,825
gpt-5-medium	0 /4			230.2s	4.8	27,036	7,516	34,553
gpt-5-mini-high	0 /4			163.1s	3.8	19,606	5,436	25,041
gpt-5-mini-low	0 /4			66.5s	3.3	7,325	1,816	9,141
gpt-5-mini-medium	0 /4			159.5s	3.3	9,294	3,651	12,945
gpt-5-nano-high	0 /4			124.8s	7.0	441,194	10,639	451,833
gpt-5-nano-low	0 /4			147.8s	13.8	1,482,097	20,362	1,502,459
gpt-5-nano-medium	0 /4			118.8s	8.5	582,051	14,474	596,525
gpt-oss-120b	0 /4			13.0s	4.8	21,004	752	21,756
grok-4	0 /4			599.2s	11.0	471,811	18,585	490,395
grok-code-fast-1	0 /4			39.2s	6.0	95,126	2,947	100,110
kimi-k2-0711	0 /4			409.5s	14.5	536,471	6,654	543,125
kimi-k2-0905	0 /4			1594.7s	40.5	2,908,584	21,853	2,930,437
o3	0 /4			137.2s	12.5	326,245	7,637	333,882
o4-mini	0 /4			61.5s	6.8	35,665	5,182	40,847
qwen-3-coder-plus	0 /4			614.7s	57.8	4,246,168	22,976	4,269,144
qwen-3-max	0 /4			375.3s	3.0	9,646	3,652	13,298

Task State

Task Initial State Files

Download ZIP package to view the complete file structure

papers/ ├── 1707.06347.html ├── 2105.04165.html ├── 2201.11903.html ├── 2303.08774.html ├── 2306.08640.html ├── 2310.02255.html ├── 2310.08446.html ├── 2312.00849.html ├── 2312.07533.html ├── 2312.11805.html ├── 2402.00253.html ├── 2402.03300.html ├── 2403.05530.html ├── 2404.13046.html ├── 2404.14367.html ├── 2404.14396.html ├── 2405.09818.html ├── 2405.13911.html ├── 2405.16473.html ├── 2405.16640.html ├── 2406.08478.html ├── 2406.16852.html ├── 2406.17294.html ├── 2407.01284.html ├── 2407.01509.html ├── 2407.21783.html ├── 2408.03326.html ├── 2408.12528.html ├── 2409.19256.html ├── 2410.05993.html ├── 2410.06166.html ├── 2410.10563.html ├── 2410.13848.html ├── 2410.17885.html ├── 2410.21276.html ├── 2411.07975.html ├── 2411.10442.html ├── 2411.11930.html ├── 2411.14432.html ├── 2412.05271.html ├── 2412.08443.html ├── 2412.10302.html ├── 2412.15115.html ├── 2412.16720.html ├── 2412.17256.html ├── 2412.18319.html ├── 2412.20631.html ├── 2501.04686.html ├── 2501.06186.html ├── 2501.12599.html ├── 2501.12948.html ├── 2501.17811.html ├── 2502.01456.html ├── 2502.09621.html ├── 2502.10391.html ├── 2502.13923.html ├── 2503.01785.html ├── 2503.06520.html ├── 2503.06749.html ├── 2503.07065.html ├── 2503.07365.html ├── 2503.07536.html ├── 2503.10291.html ├── 2503.10615.html ├── 2503.12937.html ├── 2503.13939.html ├── 2503.14476.html ├── 2503.17352.html ├── 2503.18892.html ├── 2503.19786.html ├── 2503.20783.html ├── 2503.21620.html ├── 2503.21776.html ├── 2503.22679.html ├── 2504.02587.html ├── 2504.05599.html ├── 2504.07491.html ├── 2504.07934.html ├── 2504.07954.html ├── 2504.11455.html ├── 2504.14945.html ├── 2504.16656.html ├── 2505.00703.html └── arxiv_2025.bib

Instruction

Please use FileSystem tools to finish the following task:

Task Description

You are given a directory containing multiple paper files. You have a collection of academic papers in HTML format from arXiv. Your task is to analyze these papers, identify authors who have published multiple papers, and organize them into author-specific folders based on specified criteria.

Task Objectives

Part 1: Frequent Authors (≥4 papers)

Extract author information from all HTML papers in the given directory
Identify authors who appear in 4 or more papers
Create a directory frequent_authors
Create individual folders within this directory for each frequent author (lowercase names with underscores)
Copy their papers to their respective folders

Part 2: Prolific 2025 Authors (≥3 papers)

Extract publication dates along with author information
Identify authors who published 3 or more papers in 2025
Create a directory 2025_authors for 2025 authors
Create individual folders within this directory for each prolific 2025 author (lowercase names with underscores)
Copy their 2025 papers to their respective folders

Expected Output

Directory Structure:

Plaintext

[given_task_folder]/
├── [original HTML files remain untouched]
├── frequent_authors/              # Authors with ≥4 papers total
│   ├── smith_john/
│   │   └── [copied papers]
│   ├── johnson_sarah/
│   │   └── [copied papers]
│   └── ...
└── 2025_authors/                  # Authors with ≥3 papers in 2025
    ├── williams_david/
    │   └── [copied 2025 papers]
    ├── brown_emily/
    │   └── [copied 2025 papers]
    └── ...

Requirements:

Author folder names should be lowercase with underscores replacing spaces/commas (e.g., smith_john, williams_david)
Papers should be copied (not moved) to preserve originals
Author extraction should handle various name formats correctly

Verify

Python

#!/usr/bin/env python3
"""
Verification script for Paper Organization Task: Author-Based Paper Categorization
"""

import sys
from pathlib import Path
import os
import re
from typing import Dict, List, Set
from html.parser import HTMLParser
from datetime import datetime

def get_test_directory() -> Path:
    """Get the test directory from FILESYSTEM_TEST_DIR env var."""
    test_root = os.environ.get("FILESYSTEM_TEST_DIR")
    if not test_root:
        raise ValueError("FILESYSTEM_TEST_DIR environment variable is required")
    return Path(test_root)

class ArxivHTMLParser(HTMLParser):
    """Parser to extract author and date information from arXiv HTML papers."""
    
    def __init__(self):
        super().__init__()
        self.authors = []
        self.publication_date = None
        
    def handle_starttag(self, tag, attrs):
        # Look for author metadata tags
        if tag == 'meta':
            attr_dict = dict(attrs)
            if attr_dict.get('name') == 'citation_author':
                content = attr_dict.get('content', '')
                if content:
                    self.authors.append(content)
            elif attr_dict.get('name') in ['citation_date', 'citation_online_date']:
                content = attr_dict.get('content', '')
                if content and not self.publication_date:
                    self.publication_date = content

def extract_paper_info(html_file: Path) -> tuple[List[str], str]:
    """Extract authors and publication year from an HTML paper."""
    try:
        with open(html_file, 'r', encoding='utf-8', errors='ignore') as f:
            content = f.read()
            
        parser = ArxivHTMLParser()
        parser.feed(content)
        
        # Extract year from date if available
        year = None
        if parser.publication_date:
            # Parse year from date string (e.g., "2025/03/13")
            year_match = re.search(r'(\d{4})', parser.publication_date)
            if year_match:
                year = year_match.group(1)
        
        return parser.authors, year
        
    except Exception as e:
        print(f"Warning: Could not parse {html_file.name}: {e}")
        return [], None

def normalize_author_name(author: str) -> str:
    """Normalize author name to lowercase with underscores."""
    # Author names are in "Last, First Middle" format
    # We need to convert to "first_last" format
    
    # Remove any HTML entities or special characters that shouldn't be there
    author = author.strip()
    
    # Split by comma to separate last and first names
    parts = author.split(',', 1)
    if len(parts) == 2:
        last_name = parts[0].strip()
        first_names = parts[1].strip()
        # Take only the first name (not middle names)
        first_name_parts = first_names.split()
        if first_name_parts:
            first_name = first_name_parts[0]
            # Format as "first_last"
            normalized = f"{first_name}_{last_name}"
        else:
            normalized = last_name
    else:
        # If no comma, use as is
        normalized = author
    
    # Convert to lowercase and replace spaces/special chars with underscores
    normalized = re.sub(r'[^\w\s-]', '', normalized)
    normalized = re.sub(r'[\s-]+', '_', normalized)
    return normalized.lower()

def verify_directories_exist(test_dir: Path) -> bool:
    """Verify that required directories exist."""
    frequent_authors_dir = test_dir / "frequent_authors"
    authors_2025_dir = test_dir / "2025_authors"
    
    if not frequent_authors_dir.exists():
        print("❌ 'frequent_authors' directory not found")
        return False
    
    if not authors_2025_dir.exists():
        print("❌ '2025_authors' directory not found")
        return False
    
    if not frequent_authors_dir.is_dir():
        print("❌ 'frequent_authors' exists but is not a directory")
        return False
        
    if not authors_2025_dir.is_dir():
        print("❌ '2025_authors' exists but is not a directory")
        return False
    
    print("✅ Both required directories exist")
    return True

def analyze_papers(test_dir: Path) -> tuple[Dict[str, List[Path]], Dict[str, List[Path]]]:
    """Analyze all HTML papers and return author-paper mappings."""
    author_papers = {}  # author -> list of papers
    author_2025_papers = {}  # author -> list of 2025 papers
    
    # Find all HTML files
    html_files = list(test_dir.glob("*.html"))
    
    for html_file in html_files:
        authors, year = extract_paper_info(html_file)
        
        for author in authors:
            if not author:
                continue
                
            normalized_name = normalize_author_name(author)
            if not normalized_name:
                continue
            
            # Track all papers by author
            if normalized_name not in author_papers:
                author_papers[normalized_name] = []
            author_papers[normalized_name].append(html_file)
            
            # Track 2025 papers
            if year == '2025':
                if normalized_name not in author_2025_papers:
                    author_2025_papers[normalized_name] = []
                author_2025_papers[normalized_name].append(html_file)
    
    return author_papers, author_2025_papers

def verify_frequent_authors(test_dir: Path, author_papers: Dict[str, List[Path]]) -> bool:
    """Verify that authors with ≥4 papers have their folders and papers."""
    frequent_authors_dir = test_dir / "frequent_authors"
    
    # Find authors with 4 or more papers
    frequent_authors = {author: papers for author, papers in author_papers.items() 
                        if len(papers) >= 4}
    
    if not frequent_authors:
        print("⚠️  No authors found with 4 or more papers")
        # This might be expected depending on the test data
        return True
    
    all_correct = True
    
    for author, expected_papers in frequent_authors.items():
        author_dir = frequent_authors_dir / author
        
        # Check if author directory exists
        if not author_dir.exists():
            print(f"❌ Missing directory for frequent author: {author}")
            all_correct = False
            continue
        
        # Check if all expected papers are present
        for paper in expected_papers:
            paper_copy = author_dir / paper.name
            if not paper_copy.exists():
                print(f"❌ Missing paper {paper.name} in {author} directory")
                all_correct = False
    
    # Check for unexpected directories
    for item in frequent_authors_dir.iterdir():
        if item.is_dir():
            dir_name = item.name
            if dir_name not in frequent_authors:
                # Check if this author has less than 4 papers
                if dir_name in author_papers and len(author_papers[dir_name]) < 4:
                    print(f"❌ Author {dir_name} has only {len(author_papers[dir_name])} papers but has a folder in frequent_authors")
                    all_correct = False
    
    if all_correct:
        print(f"✅ Frequent authors correctly organized ({len(frequent_authors)} authors)")
    
    return all_correct

def verify_2025_authors(test_dir: Path, author_2025_papers: Dict[str, List[Path]]) -> bool:
    """Verify that authors with ≥3 papers in 2025 have their folders and papers."""
    authors_2025_dir = test_dir / "2025_authors"
    
    # Find authors with 3 or more papers in 2025
    prolific_2025_authors = {author: papers for author, papers in author_2025_papers.items() 
                             if len(papers) >= 3}
    
    if not prolific_2025_authors:
        print("⚠️  No authors found with 3 or more papers in 2025")
        # This might be expected depending on the test data
        return True
    
    all_correct = True
    
    for author, expected_papers in prolific_2025_authors.items():
        author_dir = authors_2025_dir / author
        
        # Check if author directory exists
        if not author_dir.exists():
            print(f"❌ Missing directory for 2025 author: {author}")
            all_correct = False
            continue
        
        # Check if all expected 2025 papers are present
        for paper in expected_papers:
            paper_copy = author_dir / paper.name
            if not paper_copy.exists():
                print(f"❌ Missing 2025 paper {paper.name} in {author} directory")
                all_correct = False
    
    # Check for unexpected directories
    for item in authors_2025_dir.iterdir():
        if item.is_dir():
            dir_name = item.name
            if dir_name not in prolific_2025_authors:
                # Check if this author has less than 3 papers in 2025
                if dir_name in author_2025_papers and len(author_2025_papers[dir_name]) < 3:
                    print(f"❌ Author {dir_name} has only {len(author_2025_papers[dir_name])} papers in 2025 but has a folder in 2025_authors")
                    all_correct = False
    
    if all_correct:
        print(f"✅ 2025 authors correctly organized ({len(prolific_2025_authors)} authors)")
    
    return all_correct

def verify_original_files_intact(test_dir: Path) -> bool:
    """Verify that original HTML files are still present (not moved)."""
    html_files = list(test_dir.glob("*.html"))
    
    if not html_files:
        print("❌ No original HTML files found in root directory")
        return False
    
    print(f"✅ Original HTML files remain intact ({len(html_files)} files)")
    return True

def verify_naming_convention(test_dir: Path) -> bool:
    """Verify that author folder names follow the correct naming convention."""
    frequent_authors_dir = test_dir / "frequent_authors"
    authors_2025_dir = test_dir / "2025_authors"
    
    all_correct = True
    
    # Check frequent_authors subdirectories
    for author_dir in frequent_authors_dir.iterdir():
        if author_dir.is_dir():
            name = author_dir.name
            # Check for lowercase and underscores only
            if not re.match(r'^[a-z0-9_]+$', name):
                print(f"❌ Invalid folder name in frequent_authors: {name} (should be lowercase with underscores)")
                all_correct = False
    
    # Check 2025_authors subdirectories
    for author_dir in authors_2025_dir.iterdir():
        if author_dir.is_dir():
            name = author_dir.name
            # Check for lowercase and underscores only
            if not re.match(r'^[a-z0-9_]+$', name):
                print(f"❌ Invalid folder name in 2025_authors: {name} (should be lowercase with underscores)")
                all_correct = False
    
    if all_correct:
        print("✅ All author folder names follow correct naming convention")
    
    return all_correct

def main():
    """Main verification function."""
    try:
        test_dir = get_test_directory()
        print(f"🔍 Verifying paper organization in: {test_dir}")
        
        # Analyze papers first
        print("\n📊 Analyzing papers...")
        author_papers, author_2025_papers = analyze_papers(test_dir)
        
        # Run verification checks
        checks = [
            ("Directory existence", lambda: verify_directories_exist(test_dir)),
            ("Original files intact", lambda: verify_original_files_intact(test_dir)),
            ("Frequent authors organization", lambda: verify_frequent_authors(test_dir, author_papers)),
            ("2025 authors organization", lambda: verify_2025_authors(test_dir, author_2025_papers)),
            ("Naming conventions", lambda: verify_naming_convention(test_dir))
        ]
        
        all_passed = True
        for check_name, check_func in checks:
            print(f"\n📋 Checking: {check_name}")
            if not check_func():
                all_passed = False
        
        if all_passed:
            print("\n🎉 All verification checks passed!")
            sys.exit(0)
        else:
            print("\n❌ Some verification checks failed!")
            sys.exit(1)
            
    except Exception as e:
        print(f"❌ Verification failed with error: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()