Published on

MCP benchmark Leaderboard 2025-09-10 Update

Explore the latest MCPMark leaderboard update featuring top MCP benchmark models like Qwen-3-Max, Grok-Code-Fast-1, and Kimi-K2-0905. Discover their tool-use capabilities, success rates, and cost efficiency for real-world MCP applications.

MCPMark is a comprehensive, stress-testing MCP benchmark and a collection of diverse, verifiable tasks designed to evaluate model and agent capabilities in real-world MCP use.

MCP Benchmark Leaderboard 2025-09-10 Update

MCP benchmark Leaderboard Update ⚡

We’ve added three newly released models to the leaderboard for MCP tool-use capabilities: Qwen-3-Max, Grok-Code-Fast-1, and Kimi-K2-0905.

Key highlights:

  • Qwen-3-Coder takes the #1 spot among leading open-source models, with an impressive per-run cost of just $36.46.

  • Grok-Code-Fast-1 offers the lowest per-run cost ($36.46) and the fastest average agent time (156.6s) across the top 10 models.

  • Kimi-K2-0905 outperforms Kimi2 in success rate, though at nearly double the per-run cost and average agent time.

MCP Benchmark Leaderboard update - Open Source Models | Qwen-3-coder-plus rank at #1

Notably, Qwen-3-Coder delivers a success rate remarkably close to O3, but at roughly one-third of the per-run cost — giving the community a cost-effective option for MCP tool-use applications.

MCP Benchmark Leaderboard update - all models

also see more on our X post: