MCPMark

MCPMark is a comprehensive evaluation suite for evaluating the agentic ability of frontier models.

MCPMark includes Model Context Protocol (MCP) service in following environments

Notion
Github
Filesystem
Postgres
Playwright
Playwright-WebArena

General Procedure

MCPMark is designed to run agentic tasks in complex environment safely. Specifically, it sets up an isolated environment for the experiment, completing the task, and then destroy the environment without affecting existing user profile or information.

How to Use MCPMark

MCPMark Installation.
Authorize service (for Github and Notion).
Configure the environment variables in .mcp_env.
Run MCPMark experiment.

Please refer to Quick Start for details regarding how to start a sample filesystem experiment in properly, and Task Page for task details. Please visit Installation and Docker Uusage information of full MCPMark setup.

Running MCPMark

MCPMark supports the following mode to run experiments (suppose the experiment is named as new_exp, and the model used are o3 and gpt-4.1 and the environment is notion), with K repetive experiments.

MCPMark in Pip Installation

Shell

# Evaluate ALL tasks
python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3 --k K

# Evaluate a single task group (online_resume)
python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume --models o3 --k K

# Evaluate one specific task (task_1 in online_resume)
python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume/task_1 --models o3 --k K

# Evaluate multiple models
python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3,gpt-4.1 --k K

MCPMark in Docker Installation

Shell

# Run all tasks for one service
./run-task.sh --mcp notion --models o3 --exp-name new_exp --tasks all

# Run comprehensive benchmark across all services
./run-benchmark.sh --models o3,gpt-4.1 --exp-name new_exp --docker

Experiment Auto-Resume

For re-run experiments, only unfinished tasks will be executed. Tasks that previously failed due to pipeline errors (such as State Duplication Error or MCP Network Error) will also be retried automatically.

Results

The experiment results are written to ./results/ (JSON + CSV).

Reult Aggregation (for K > 1)

MCP supports aggreated metrics of pass@1, pass@K, pass^K, avg@K.

Shell

python -m src.aggregators.aggregate_results --exp-name new_exp

Model Support

MCPMark supports the following models with according providers (model codes in the brackets).

OpenAI

GPT-5 (gpt-5)
o3 (o3)

Anthropic

Claude-4.1-Opus (claude-4.1-opus)
Claude-4-Sonnet (claude-4-sonnet)

Google

Gemini-2.5-Pro (gemini-2.5-pro)

Grok

Grok-4 (grok-4)

Deepseek

DeepSeek-Chat (deepseek-chat)

Alibaba

Qwen3-Coder (qwen-3-coder)

Kimi

Kimi-K2 (k2)

Want to contribute?

Visit Contributing Page to learn how to make contribution to MCPMark.