MCPMark
MCPMark is a comprehensive evaluation suite for evaluating the agentic ability of frontier models.
MCPMark includes Model Context Protocol (MCP) service in following environments
- Notion
- Github
- Filesystem
- Postgres
- Playwright
- Playwright-WebArena
General Procedure
MCPMark is designed to run agentic tasks in complex environment safely. Specifically, it sets up an isolated environment for the experiment, completing the task, and then destroy the environment without affecting existing user profile or information.
How to Use MCPMark
- MCPMark Installation.
- Authorize service (for Github and Notion).
- Configure the environment variables in
.mcp_env
. - Run MCPMark experiment.
Please refer to Quick Start for details regarding how to start a sample filesystem experiment in properly, and Task Page for task details. Please visit Installation and Docker Uusage information of full MCPMark setup.
Running MCPMark
MCPMark supports the following mode to run experiments (suppose the experiment is named as new_exp, and the model used are o3 and gpt-4.1 and the environment is notion), with Pass@K as the evaluation metric.
MCPMark in Pip Installation
MCPMark in Docker Installation
Experiment Auto-Resume
For re-run experiments, only unfinished tasks will be executed. Tasks that previously failed due to pipeline errors (such as State Duplication Error or MCP Network Error) will also be retried automatically.
Results
The experiment results are written to ./results/
(JSON + CSV).
Reult Aggregation (for K > 1)
MCP supports aggreated metrics of pass@1, pass@K, pass^K, avg@K.
Want to contribute?
Visit Contributing Page to learn how to make contribution to MCPMark.