Published on

Introducing MCPMark Verified

MCPMark Verified is a stabilized, version-pinned subset of MCPMark's standard tasks for more reliable and reproducible MCP evaluation.

The standard tasks themselves are unchanged. What changed is the rigor around them: every environment is pinned to a fixed server version, and every verification script has been reviewed and tightened so that a correct solution passes and an incorrect one fails, consistently across runs and over time.

MCPMark Verified was built with the support of the Kimi team, whose evaluation work surfaced the reliability gaps in the original set — ambiguous verifiers and unpinned server behavior — that this pass set out to fix.

Pinned environments

MCP servers change, and a verifier that depends on unpinned server behavior produces scores that drift between runs. In Verified, each server is pinned to an exact version.

EnvironmentPinned server
Filesystem@modelcontextprotocol/server-filesystem@2025.12.18
GitHubghcr.io/github/github-mcp-server:v0.15.0
Notion@notionhq/notion-mcp-server@1.9.1
Playwright@playwright/mcp@0.0.68
Postgrespostgres-mcp==0.3.0

The evaluation harness is pinned as well, including model call parameters, reasoning-effort handling, and the agent loop, so the model under test is the only variable across runs.

Verifier changes

The tasks are identical to standard MCPMark; only the verification changed. We reviewed all 127 standard tasks and grouped the fixes into two categories.

Major changes rebuilt a verifier or its fixture:

  • Postgres dba_vector_analysis: the 500-line vector fixture is inlined into setup and the verifier rewritten for deterministic state.
  • Playwright extraction_table: regenerated the reference data and rewrote the extraction checks.
  • WebArena search_filtering_operations and the shopping-admin analytics tasks (fitness_promotion_strategy, marketing_customer_analysis, sales_inventory_analysis, customer_segmentation_setup): reworked logic and clarified the descriptions.
  • Notion work_history_addition, hyperfocus_analysis_report, and quarterly_review_dashboard: overhauled verifiers and descriptions on the most error-prone pages.

Minor changes were targeted robustness fixes: GitHub (PR-title-aware squash detection, case-insensitive matching, pinned ESLint v8), Postgres (tighter acceptance conditions and role cleanup), Filesystem (clarified descriptions), and smart-quote normalization across WebArena.

Results

Scores on the Verified set, by environment (points / available):

ModelFilesystemNotionGitHubPlaywrightPostgresTotal
GPT-5.5 (xhigh)26 / 3025 / 2822 / 2324 / 2521 / 2192.9%
Kimi K2.725 / 3022 / 2817 / 2321 / 2518 / 2181.1%
Claude Opus 4.8 (max)26 / 3022 / 2813 / 2317 / 2519 / 2176.4%
Kimi K2.621 / 3020.5 / 2814 / 2319 / 2518 / 2172.8%

GPT-5.5 (xhigh) scores highest at 92.9%, with near-complete results on GitHub and Postgres. Kimi K2.7 reaches 81.1%, up 8.3 points from Kimi K2.6 and above Claude Opus 4.8 (max). Notion and GitHub remain the lowest-scoring environments across all four models.

Cost and efficiency

We also record steps and token usage, averaged per task:

ModelAvg stepsAvg inputAvg output
Kimi K2.714.8659.1k7.2k
GPT-5.5 (xhigh)13.2518.7k8.5k
Claude Opus 4.8 (max)15.11,241.5k6.0k
Kimi K2.620.8660.3k10.7k

Kimi K2.7 averages 14.8 steps, about 28% fewer than Kimi K2.6, at a comparable input size. Claude Opus 4.8 (max) uses the most input tokens at 1.24M on average, consistent with its maximum reasoning-effort setting.

Availability

The Verified tasks, pinned environments, and verification scripts are available on GitHub. The full set of changes is in this release PR.

Acknowledgments

Thanks to the Kimi team, whose push for a stricter and reproducible evaluation set motivated this work.