English · Español
BluMind Benchmark
The public benchmark of AI reasoning applied to water-treatment-plant operations — in Spanish.
BluMind evaluates AI models on real diagnostic and reasoning tasks drawn from the operation of water-treatment plants. Every response is scored by the BluMind Technical Committee — senior practitioners and researchers of the water sector — against a private gold standard.
The benchmark is public, reproducible, and human-scored. The leaderboard updates as new models are submitted and as the Technical Committee releases new cases.
Leaderboard · v1.0
v1.0 covers the 5 core failure families (FOUL, SCAL, OXID, MECH, NOWE) of reverse-osmosis desalination plants — 31 cases, 12 models evaluated, scored by the BluMind Technical Committee.
| # | Subject | Provider | Mode | Mean (/12) | Q ↑ | Status |
|---|---|---|---|---|---|---|
| 1 | claude-opus-4-7 | Anthropic | reasoning | 11.03 | 0.91 | Eligible |
| 2 | gpt-5-5 | OpenAI | reasoning | 10.97 | 0.91 | Eligible |
| 3 | gpt-5 | OpenAI | classic | 10.87 | 0.89 | Eligible |
| 4 | claude-haiku-4-5 | Anthropic | classic | 10.48 | 0.84 | Disqualified |
| 5 | claude-opus-4-6 | Anthropic | classic | 10.58 | 0.83 | Disqualified |
Top 5 by composite quality score Q. Disqualified models triggered the safety gate on at least one case.
See the full leaderboard, operational costs, and safety-gate details on GitHub →
What makes BluMind different
Independent human scoring. Every response is scored by two members of the BluMind Technical Committee, drawn from senior practitioners and researchers of the water sector. The committee is the institutional authority behind every score.
Safety gate. A single critical-fail recommendation — any action that would damage the plant or compromise operator safety — disqualifies the model from the leaderboard regardless of its other scores. The triggering action is cited literally and made public.
Reproducible. Cases, rubric, prompts, evaluation scripts and aggregated metrics are public. The private gold answers and reviewer mappings stay private — exactly as one would expect from a benchmark that can be trusted.
Read the full methodology on GitHub →
Submit a model
During the foundational phase (until 31 December 2026), valid submissions are evaluated free of charge. The submitter provides metadata, technical access credentials encrypted with the BluMind PGP key, and confirms eligibility against the published scope.
A submission is normally validated within 2 working days and evaluated within 10 working days of validation.
Read the submission guide on GitHub →
The committee
The BluMind Technical Committee is the body of senior practitioners and researchers responsible for the integrity of the benchmark. It is the institutional authority behind every score, classification, and appeal decision.
Public members include Álvaro Díaz del Río Redondo — CEO of BluMind, formerly Head of Innovation at Tedagua and Cobra Infraestructuras Hidráulicas — and Rafael Jiménez Garrido — Country Manager at Whitewater Group, lecturer at the Master’s Degree in Desalination and Water Reuse (Universidad de Alicante), industry contributor at ALADYR.
Three additional senior international figures of the water sector are part of the committee, with names pending public disclosure.
Meet the committee on GitHub →
Contact
- Submissions: submissions@blumind.es · PGP public key
- Technical Committee: committee@blumind.es
- General inquiries: info@blumind.es
- Repository: github.com/blumind/benchmark
BluMind Benchmark is operated by BluMind. The benchmark is released under the license terms in LICENSE.