MCP benchmark Leaderboard 2025-09-10 Update
Explore the latest MCPMark leaderboard update featuring top MCP benchmark models like Qwen-3-Max, Grok-Code-Fast-1, and Kimi-K2-0905. Discover their tool-use capabilities, success rates, and cost efficiency for real-world MCP applications.
MCPMark is a comprehensive, stress-testing MCP benchmark and a collection of diverse, verifiable tasks designed to evaluate model and agent capabilities in real-world MCP use.
MCP benchmark Leaderboard Update ⚡
We’ve added three newly released models to the leaderboard for MCP tool-use capabilities: Qwen-3-Max, Grok-Code-Fast-1, and Kimi-K2-0905.
Key highlights:
-
Qwen-3-Coder takes the #1 spot among leading open-source models, with an impressive per-run cost of just $36.46.
-
Grok-Code-Fast-1 offers the lowest per-run cost ($36.46) and the fastest average agent time (156.6s) across the top 10 models.
-
Kimi-K2-0905 outperforms Kimi2 in success rate, though at nearly double the per-run cost and average agent time.
Notably, Qwen-3-Coder delivers a success rate remarkably close to O3, but at roughly one-third of the per-run cost — giving the community a cost-effective option for MCP tool-use applications.
also see more on our X post: