BluMind Benchmark
The public benchmark of AI reasoning applied to water-treatment-plant operations.
BluMind evaluates AI models on real diagnostic and reasoning tasks drawn from the operation of water-treatment plants. Every response is scored by the BluMind Technical Committee — senior practitioners and researchers of the water sector — against a private gold standard.
The benchmark is public, reproducible, and human-scored. The leaderboard updates as new models are submitted and as the Technical Committee releases new cases.
🏆 Ranking · v1.0
v1.0 covers the 5 core failure families (FOUL, SCAL, OXID, MECH, NOWE) of reverse-osmosis desalination plants — 31 cases, 26 model invocations evaluated (13 distinct models, several at multiple reasoning-effort levels), scored by the BluMind Technical Committee.
| # | Subject | Provider | Mode | Pass | Cond | Fail | Crit | Mean (/12) | Brier ↓ | ECE ↓ | Q ↑ | Status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | gpt-5-5-none | 🧠 reasoning | 29 | 2 | 0 | 0 | 11.10 | 0.023 | 0.143 | 0.93 | ✅ Eligible | |
| 2 | gpt-5-5-xhigh | 🧠 reasoning | 29 | 2 | 0 | 0 | 11.03 | 0.021 | 0.135 | 0.93 | ✅ Eligible | |
| 3 | gpt-5-5-high | 🧠 reasoning | 29 | 2 | 0 | 0 | 10.97 | 0.022 | 0.136 | 0.92 | ✅ Eligible | |
| 4 | claude-opus-4-7-medium | 🧠 reasoning | 28 | 3 | 0 | 0 | 11.03 | 0.036 | 0.170 | 0.91 | ✅ Eligible | |
| 5 | gpt-5-5-low | 🧠 reasoning | 28 | 3 | 0 | 0 | 11.00 | 0.025 | 0.148 | 0.91 | ✅ Eligible | |
| 6 | gpt-5-5-medium | 🧠 reasoning | 28 | 3 | 0 | 0 | 10.97 | 0.024 | 0.141 | 0.91 | ✅ Eligible | |
| 7 | claude-opus-4-7-high | 🧠 reasoning | 28 | 3 | 0 | 0 | 10.94 | 0.041 | 0.189 | 0.91 | ✅ Eligible | |
| 8 | claude-opus-4-7-off | 🧠 reasoning | 28 | 3 | 0 | 0 | 10.84 | 0.038 | 0.173 | 0.90 | ✅ Eligible | |
| 9 | claude-opus-4-7-xhigh | 🧠 reasoning | 28 | 2 | 1 | 1 | 10.81 | 0.057 | 0.166 | 0.90 | ⛔ Disqualified | |
| 10 | claude-opus-4-7-max | 🧠 reasoning | 28 | 2 | 1 | 1 | 10.77 | 0.041 | 0.178 | 0.90 | ⛔ Disqualified | |
| 11 | gpt-5-5-minimal | 🧠 reasoning | 27 | 4 | 0 | 0 | 10.87 | 0.025 | 0.145 | 0.89 | ✅ Eligible | |
| 12 | gpt-5-medium | classic | 27 | 4 | 0 | 0 | 10.87 | 0.034 | 0.158 | 0.89 | ✅ Eligible | |
| 13 | claude-haiku-4-5-off | classic | 25 | 5 | 1 | 1 | 10.48 | 0.037 | 0.173 | 0.84 | ⛔ Disqualified | |
| 14 | claude-opus-4-6-off | classic | 24 | 6 | 1 | 1 | 10.58 | 0.035 | 0.100 | 0.83 | ⛔ Disqualified | |
| 15 | deepseek-v4-flash-high | classic | 22 | 8 | 1 | 1 | 10.16 | 0.040 | 0.137 | 0.78 | ⛔ Disqualified | |
| 16 | claude-opus-4-7-low | 🧠 reasoning | 20 | 11 | 0 | 0 | 10.00 | 0.034 | 0.155 | 0.74 | ✅ Eligible | |
| 17 | gemini-3-5-flash-high | 🧠 reasoning | 19 | 12 | 0 | 0 | 9.74 | 0.023 | 0.027 | 0.71 | ✅ Eligible | |
| 18 | mistral-small-3 | classic | 18 | 12 | 1 | 1 | 9.74 | 0.039 | 0.037 | 0.70 | ⛔ Disqualified | |
| 19 | deepseek-v4-flash-max | 🧠 reasoning | 17 | 13 | 1 | 0 | 9.55 | 0.029 | 0.143 | 0.67 | ✅ Eligible | |
| 20 | gemini-2-5-pro | classic | 14 | 16 | 1 | 0 | 9.48 | 0.009 | 0.035 | 0.62 | ✅ Eligible | |
| 21 | gemini-3-5-flash-medium | 🧠 reasoning | 13 | 16 | 2 | 0 | 9.00 | 0.030 | 0.033 | 0.58 | ✅ Eligible | |
| 22 | gemini-3-5-flash-low | 🧠 reasoning | 9 | 21 | 1 | 0 | 9.10 | 0.025 | 0.038 | 0.52 | ✅ Eligible | |
| 23 | gemini-3-1-flash-lite-minimal | classic | 5 | 25 | 1 | 1 | 8.32 | 0.018 | 0.067 | 0.43 | ⛔ Disqualified | |
| 24 | mistral-medium-3 | classic | 0 | 27 | 4 | 0 | 7.84 | 0.035 | 0.076 | 0.33 | ✅ Eligible | |
| 25 | gemini-2-5-flash-lite-off | classic | 0 | 24 | 7 | 3 | 7.35 | 0.050 | 0.039 | 0.31 | ⛔ Disqualified | |
| 26 | gpt-3-5-turbo | classic | 0 | 9 | 22 | 2 | 5.48 | 0.142 | 0.268 | 0.23 | ⛔ Disqualified |
Reading the suffix.
-low,-medium,-high,-xhigh,-maxdenote the reasoning effort sent to the model.-offdenotes the model invoked with its thinking path disabled.-noneis OpenAI GPT-5.5’s explicit “reasoning OFF” tier.-minimalis the lowest non-zero effort on providers that expose it (e.g. Gemini 3.x Flash-Lite). Samesubject_version↔ same model snapshot — only the effort knob differs.
📚 How to read this table
- Pass / Cond / Fail — Per-case classification by the Technical Committee. Pass = a response an experienced operator would accept on its own. Conditional = a response with gaps but salvageable. Fail = a response that would mislead a real operator.
- Crit — Critical automatic fails. Cases where the response recommended an action that would damage the plant or compromise operator safety (for example, recommending an oxidant on polyamide membranes). A single critical fail disqualifies the model from the leaderboard, regardless of all other scores. The triggering action is cited literally in the full leaderboard on GitHub.
- Mean (/12) — Average expert-scored quality per case, on the 0–12 rubric. 12 = “indistinguishable from the expert gold answer”; 0 = “completely wrong”.
-
Brier ↓ — A measure of confidence, that is “whether the model believes it knows more than it actually does, or not”.
Read it like this: if the model says “I’m very sure”, Brier measures how heavily it is penalised by its errors when that confidence was not justified. The lower the better: it means the model not only gets things right, it also declares a reasonable level of confidence.
In one sentence: Brier measures whether the model’s confidence is prudent on each response.
Range: 0–1. Lower is better. In v1.0 the observed spread is 0.009 (best) to 0.142 (worst).
Technically, on every case BluMind computes the squared gap between the model’s stated confidence — for example, “90% sure” = 0.9 — and the actual correctness of the response
{0, 0.5, 1}. A response with “100% confidence” that turns out to be wrong contributes a very large error. -
ECE ↓ (Expected Calibration Error) — How trustworthy the model’s confidence is in aggregate.
Read it like this: if the model says “I’m 70% sure” many times, ECE measures whether it actually gets close to 70% of those right. The lower the better: it means its confidence percentages look more like real probabilities.
Technically, BluMind groups predictions into 10 confidence bands — 0-10%, 10-20%, …, 90-100% — and in each band compares the average stated confidence against the average actual correctness, weighted by the number of cases in each band.
For example, if the model says “70% sure” on 20 cases but only gets 50% right, that band is miscalibrated and increases the ECE.
Range: 0–1. Lower is better. In v1.0 the observed spread is 0.035 (best) to 0.268 (worst). Labelled indicative at N = 31.
In one sentence: ECE measures whether the model’s confidence percentages can be trusted as probabilities.
-
Why both matter.
Imagine an operator who says: “I’m 90% sure this is biofouling.”
A good Brier means that, on that specific prediction, the operator/model does not tend to be extremely confident when wrong.
A good ECE means that, when they say “90% sure” many times, they really are right about 90 of every 100.
The simple difference is:
- Brier looks at the quality of confidence case by case.
- ECE looks at whether confidence is well-calibrated in aggregate.
Both metrics matter because many downstream systems — alarms, operational assistants, automated workflows, recommendation systems — may take the confidence number literally. If the model says “95% sure” but really is not, the system can over-react.
- Q ↑ — Composite quality score combining Pass-rate and mean per-case score. This is the ranking column.
- Mode —
classicmeans the model was queried attemperature = 0. 🧠reasoningmeans the model was queried using its native deep-thinking mode (Claude reasoning, GPT reasoning, etc.). - Status — ✅ Eligible if the model has zero critical fails. ⛔ Disqualified otherwise. Disqualified models are still listed for transparency, but they cannot win the leaderboard.
See operational metrics (cost, latency, tokens) and safety-gate citations on GitHub →
What makes BluMind different
Independent human scoring. Every response is scored by two members of the BluMind Technical Committee, drawn from senior practitioners and researchers of the water sector. The committee is the institutional authority behind every score.
Safety gate. A single critical-fail recommendation — any action that would damage the plant or compromise operator safety — disqualifies the model from the leaderboard regardless of its other scores. The triggering action is cited literally and made public.
Reproducible. Cases, rubric, prompts, evaluation scripts and aggregated metrics are public. The private gold answers and reviewer mappings stay private — exactly as one would expect from a benchmark that can be trusted.
Read the full methodology on GitHub →
Submit a model
During the foundational phase (until 31 December 2026), valid submissions are evaluated free of charge. The submitter provides metadata, technical access credentials encrypted with the BluMind PGP key, and confirms eligibility against the published scope.
A submission is normally validated within 2 working days and evaluated within 10 working days of validation.
Read the submission guide on GitHub →
The committee
The BluMind Technical Committee is the body of senior practitioners and researchers responsible for the integrity of the benchmark. It is the institutional authority behind every score, classification, and appeal decision.
Public members include Álvaro Díaz del Río Redondo — CEO of BluMind, formerly Head of Innovation at Tedagua and Cobra Infraestructuras Hidráulicas — and Rafael Jiménez Garrido — Country Manager at Whitewater Group, lecturer at the Master’s Degree in Desalination and Water Reuse (Universidad de Alicante), industry contributor at ALADYR.
Three additional senior international figures of the water sector are part of the committee, with names pending public disclosure.
Meet the committee on GitHub →
Contact
- Submissions: submissions@blumind.es · PGP public key
- Technical Committee: committee@blumind.es
- General inquiries: info@blumind.es
- Repository: github.com/blumind/benchmark
BluMind Benchmark is operated by BluMind. The benchmark is released under the license terms in LICENSE.