Introducing MCPMark Verified
MCPMark Verified is a stabilized, version-pinned subset of MCPMark's standard tasks for more reliable and reproducible MCP evaluation.
The standard tasks themselves are unchanged. What changed is the rigor around them: every environment is pinned to a fixed server version, and every verification script has been reviewed and tightened so that a correct solution passes and an incorrect one fails, consistently across runs and over time.
MCPMark Verified was built with the support of the Kimi team, whose evaluation work surfaced the reliability gaps in the original set — ambiguous verifiers and unpinned server behavior — that this pass set out to fix.
Pinned environments
MCP servers change, and a verifier that depends on unpinned server behavior produces scores that drift between runs. In Verified, each server is pinned to an exact version.
| Environment | Pinned server |
|---|---|
| Filesystem | @modelcontextprotocol/server-filesystem@2025.12.18 |
| GitHub | ghcr.io/github/github-mcp-server:v0.15.0 |
| Notion | @notionhq/notion-mcp-server@1.9.1 |
| Playwright | @playwright/mcp@0.0.68 |
| Postgres | postgres-mcp==0.3.0 |
The evaluation harness is pinned as well, including model call parameters, reasoning-effort handling, and the agent loop, so the model under test is the only variable across runs.
Verifier changes
The tasks are identical to standard MCPMark; only the verification changed. We reviewed all 127 standard tasks and grouped the fixes into two categories.
Major changes rebuilt a verifier or its fixture:
- Postgres
dba_vector_analysis: the 500-line vector fixture is inlined into setup and the verifier rewritten for deterministic state. - Playwright
extraction_table: regenerated the reference data and rewrote the extraction checks. - WebArena
search_filtering_operationsand the shopping-admin analytics tasks (fitness_promotion_strategy,marketing_customer_analysis,sales_inventory_analysis,customer_segmentation_setup): reworked logic and clarified the descriptions. - Notion
work_history_addition,hyperfocus_analysis_report, andquarterly_review_dashboard: overhauled verifiers and descriptions on the most error-prone pages.
Minor changes were targeted robustness fixes: GitHub (PR-title-aware squash detection, case-insensitive matching, pinned ESLint v8), Postgres (tighter acceptance conditions and role cleanup), Filesystem (clarified descriptions), and smart-quote normalization across WebArena.
Results
Scores on the Verified set, by environment (points / available):
| Model | Filesystem | Notion | GitHub | Playwright | Postgres | Total |
|---|---|---|---|---|---|---|
| GPT-5.5 (xhigh) | 26 / 30 | 25 / 28 | 22 / 23 | 24 / 25 | 21 / 21 | 92.9% |
| Kimi K2.7 | 25 / 30 | 22 / 28 | 17 / 23 | 21 / 25 | 18 / 21 | 81.1% |
| Claude Opus 4.8 (max) | 26 / 30 | 22 / 28 | 13 / 23 | 17 / 25 | 19 / 21 | 76.4% |
| Kimi K2.6 | 21 / 30 | 20.5 / 28 | 14 / 23 | 19 / 25 | 18 / 21 | 72.8% |
GPT-5.5 (xhigh) scores highest at 92.9%, with near-complete results on GitHub and Postgres. Kimi K2.7 reaches 81.1%, up 8.3 points from Kimi K2.6 and above Claude Opus 4.8 (max). Notion and GitHub remain the lowest-scoring environments across all four models.
Cost and efficiency
We also record steps and token usage, averaged per task:
| Model | Avg steps | Avg input | Avg output |
|---|---|---|---|
| Kimi K2.7 | 14.8 | 659.1k | 7.2k |
| GPT-5.5 (xhigh) | 13.2 | 518.7k | 8.5k |
| Claude Opus 4.8 (max) | 15.1 | 1,241.5k | 6.0k |
| Kimi K2.6 | 20.8 | 660.3k | 10.7k |
Kimi K2.7 averages 14.8 steps, about 28% fewer than Kimi K2.6, at a comparable input size. Claude Opus 4.8 (max) uses the most input tokens at 1.24M on average, consistent with its maximum reasoning-effort setting.
Availability
The Verified tasks, pinned environments, and verification scripts are available on GitHub. The full set of changes is in this release PR.
Acknowledgments
Thanks to the Kimi team, whose push for a stricter and reproducible evaluation set motivated this work.