BluMind Benchmark

The public benchmark of AI reasoning applied to water-treatment-plant operations.

BluMind evaluates AI models on real diagnostic and reasoning tasks drawn from the operation of water-treatment plants. Every response is scored by the BluMind Technical Committee — senior practitioners and researchers of the water sector — against a private gold standard.

The benchmark is public, reproducible, and human-scored. The leaderboard updates as new models are submitted and as the Technical Committee releases new cases.


🏆 Ranking · v1.0

v1.0 covers the 5 core failure families (FOUL, SCAL, OXID, MECH, NOWE) of reverse-osmosis desalination plants — 31 cases, 26 model invocations evaluated (13 distinct models, several at multiple reasoning-effort levels), scored by the BluMind Technical Committee.

# Subject Provider Mode Pass Cond Fail Crit Mean (/12) Brier ↓ ECE ↓ Q ↑ Status
1 gpt-5-5-none OpenAI 🧠 reasoning 29 2 0 0 11.10 0.023 0.143 0.93 ✅ Eligible
2 gpt-5-5-xhigh OpenAI 🧠 reasoning 29 2 0 0 11.03 0.021 0.135 0.93 ✅ Eligible
3 gpt-5-5-high OpenAI 🧠 reasoning 29 2 0 0 10.97 0.022 0.136 0.92 ✅ Eligible
4 claude-opus-4-7-medium Anthropic 🧠 reasoning 28 3 0 0 11.03 0.036 0.170 0.91 ✅ Eligible
5 gpt-5-5-low OpenAI 🧠 reasoning 28 3 0 0 11.00 0.025 0.148 0.91 ✅ Eligible
6 gpt-5-5-medium OpenAI 🧠 reasoning 28 3 0 0 10.97 0.024 0.141 0.91 ✅ Eligible
7 claude-opus-4-7-high Anthropic 🧠 reasoning 28 3 0 0 10.94 0.041 0.189 0.91 ✅ Eligible
8 claude-opus-4-7-off Anthropic 🧠 reasoning 28 3 0 0 10.84 0.038 0.173 0.90 ✅ Eligible
9 claude-opus-4-7-xhigh Anthropic 🧠 reasoning 28 2 1 1 10.81 0.057 0.166 0.90 ⛔ Disqualified
10 claude-opus-4-7-max Anthropic 🧠 reasoning 28 2 1 1 10.77 0.041 0.178 0.90 ⛔ Disqualified
11 gpt-5-5-minimal OpenAI 🧠 reasoning 27 4 0 0 10.87 0.025 0.145 0.89 ✅ Eligible
12 gpt-5-medium OpenAI classic 27 4 0 0 10.87 0.034 0.158 0.89 ✅ Eligible
13 claude-haiku-4-5-off Anthropic classic 25 5 1 1 10.48 0.037 0.173 0.84 ⛔ Disqualified
14 claude-opus-4-6-off Anthropic classic 24 6 1 1 10.58 0.035 0.100 0.83 ⛔ Disqualified
15 deepseek-v4-flash-high DeepSeek classic 22 8 1 1 10.16 0.040 0.137 0.78 ⛔ Disqualified
16 claude-opus-4-7-low Anthropic 🧠 reasoning 20 11 0 0 10.00 0.034 0.155 0.74 ✅ Eligible
17 gemini-3-5-flash-high Google 🧠 reasoning 19 12 0 0 9.74 0.023 0.027 0.71 ✅ Eligible
18 mistral-small-3 Mistral classic 18 12 1 1 9.74 0.039 0.037 0.70 ⛔ Disqualified
19 deepseek-v4-flash-max DeepSeek 🧠 reasoning 17 13 1 0 9.55 0.029 0.143 0.67 ✅ Eligible
20 gemini-2-5-pro Google classic 14 16 1 0 9.48 0.009 0.035 0.62 ✅ Eligible
21 gemini-3-5-flash-medium Google 🧠 reasoning 13 16 2 0 9.00 0.030 0.033 0.58 ✅ Eligible
22 gemini-3-5-flash-low Google 🧠 reasoning 9 21 1 0 9.10 0.025 0.038 0.52 ✅ Eligible
23 gemini-3-1-flash-lite-minimal Google classic 5 25 1 1 8.32 0.018 0.067 0.43 ⛔ Disqualified
24 mistral-medium-3 Mistral classic 0 27 4 0 7.84 0.035 0.076 0.33 ✅ Eligible
25 gemini-2-5-flash-lite-off Google classic 0 24 7 3 7.35 0.050 0.039 0.31 ⛔ Disqualified
26 gpt-3-5-turbo OpenAI classic 0 9 22 2 5.48 0.142 0.268 0.23 ⛔ Disqualified

Reading the suffix. -low, -medium, -high, -xhigh, -max denote the reasoning effort sent to the model. -off denotes the model invoked with its thinking path disabled. -none is OpenAI GPT-5.5’s explicit “reasoning OFF” tier. -minimal is the lowest non-zero effort on providers that expose it (e.g. Gemini 3.x Flash-Lite). Same subject_version ↔ same model snapshot — only the effort knob differs.

📚 How to read this table

See operational metrics (cost, latency, tokens) and safety-gate citations on GitHub →

Read the v1.0 Findings Report — generational uplift, hypothesis quality, calibration, and limitations →


What makes BluMind different

Independent human scoring. Every response is scored by two members of the BluMind Technical Committee, drawn from senior practitioners and researchers of the water sector. The committee is the institutional authority behind every score.

Safety gate. A single critical-fail recommendation — any action that would damage the plant or compromise operator safety — disqualifies the model from the leaderboard regardless of its other scores. The triggering action is cited literally and made public.

Reproducible. Cases, rubric, prompts, evaluation scripts and aggregated metrics are public. The private gold answers and reviewer mappings stay private — exactly as one would expect from a benchmark that can be trusted.

Read the full methodology on GitHub →


Submit a model

During the foundational phase (until 31 December 2026), valid submissions are evaluated free of charge. The submitter provides metadata, technical access credentials encrypted with the BluMind PGP key, and confirms eligibility against the published scope.

A submission is normally validated within 2 working days and evaluated within 10 working days of validation.

Read the submission guide on GitHub →


The committee

The BluMind Technical Committee is the body of senior practitioners and researchers responsible for the integrity of the benchmark. It is the institutional authority behind every score, classification, and appeal decision.

Public members include Álvaro Díaz del Río Redondo — CEO of BluMind, formerly Head of Innovation at Tedagua and Cobra Infraestructuras Hidráulicas — and Rafael Jiménez Garrido — Country Manager at Whitewater Group, lecturer at the Master’s Degree in Desalination and Water Reuse (Universidad de Alicante), industry contributor at ALADYR.

Three additional senior international figures of the water sector are part of the committee, with names pending public disclosure.

Meet the committee on GitHub →


Contact


BluMind Benchmark is operated by BluMind. The benchmark is released under the license terms in LICENSE.