Clinical decisions used to mean five browser tabs and a gut feeling. Now they leave an audit trail.

100+

AI Models Working in Sync

Referee AI is a platform that turns the "ask five AI tools and pick the best answer" workflow into something structured and auditable. The user submits one prompt. Multiple AI models run in parallel against it. A designated referee model synthesises a single consolidated answer. Every session is saved, including the individual model responses, the verdict, the timestamp, the token counts, and the cost. The client owns their own API keys, so the data never passes through Hephon's infrastructure once the platform is deployed. We built it for a Lebanese client working in regulated environments where decisions need to be defensible.

The problem we were asked to solve

Knowledge workers, including clinical staff making consequential decisions, do this all the time: open ChatGPT, open Claude, open Gemini, open Perplexity, paste the same question into all four, read the answers, pick the one that feels most right, and move on. No record of what was asked. No record of what was considered. No way to reconstruct the reasoning later.

For most contexts that's inefficient. For a regulated healthcare environment where a decision needs to be defensible, it's a problem. The clinician knows what they did, more or less. They can't prove it. If anyone asks why a particular protocol was followed, the audit trail ends at "I asked some AI tools and used my judgment."

That gap was the brief. Not "build a smarter AI." Build something that wraps the existing workflow with structure, so the reasoning is preserved and the decision can be defended.

What we built

Referee AI is a panel-of-experts platform with three core pieces.

The first is the Model Set, a named configuration that specifies which AI models run in parallel against a given prompt. Each model has its own system prompt and temperature settings, tuned for the specific use case. Clinical triage gets a different panel than treatment protocol review. The user builds the configuration once, names it, and reuses it consistently for that category of decision.

The second is the parallel execution layer. The user submits a single prompt. Every model in the set responds simultaneously, streamed side-by-side so the user can read them as they come in. Nothing happens sequentially. The wait is the slowest model in the panel, not the sum of all of them.

The third is the Referee model. Once all panel responses arrive, the designated referee synthesises a single consolidated answer using a configurable strategy: summarise across the panel, rank by confidence, reconcile differences, or pick the strongest response. The strategy is part of the Model Set, so the same kind of decision always gets resolved the same way.

Every session is saved. Every model's response is stored alongside the referee's verdict, the timestamps, the token counts, and the per-call cost. A clinician can pull up any past decision and show exactly which models were consulted, what each one said, what the referee chose, and why.

Two design decisions that shaped everything

The first was BYOK, bring your own key. The client supplies their own API keys for each AI provider, encrypted at rest on their tenant. That means they control their data residency and their spend. Nothing routes through Hephon's infrastructure once the platform is deployed. We built the platform; the client owns the keys, the bills, and the data.

This was non-negotiable for the use case. Healthcare workflows can't send patient context through a third party's infrastructure, even ours. Designing for BYOK from day one made the platform deployable in environments that would have rejected anything else.

The second was OpenRouter as the integration layer. Rather than building separate integrations for OpenAI, Anthropic, Google, Mistral, Meta, and the rest, the platform routes through a single proxy that gives access to over 100 models by ID. When a new model is released, it's available in the platform the same day, no engineering required. That decision also future-proofs the panel: as new models come out, the client can swap them into existing Model Sets without us touching the code.

The harder problems we didn't expect

Two things we underestimated.

The first was that models disagree on things you'd expect them to agree on. Run the same simple clinical question through GPT-5, Claude Opus, and Gemini Pro and you can get three different recommendations, with different reasoning, and different levels of confidence. The interesting design question wasn't "which model is right." It was "how do you present disagreement to a clinician in a way that helps them decide, rather than overwhelming them."

The answer ended up being the Referee model itself, with a strategy specifically tuned to surface disagreement as a feature, not hide it. When models agree, the referee gives a confident consolidated answer. When they disagree, the referee says so explicitly: "Three of four models recommended X; one recommended Y, citing this specific consideration." The clinician sees the divergence and decides.

The second was that latency varies wildly between models, and the slowest model in a panel sets the user's experience. Streaming the responses side-by-side helped, because the user can start reading the fast models while the slow ones finish. But we still had to design the UI carefully so a single slow model didn't make the whole tool feel sluggish.

Where it stands

Referee AI is in active build. The client approved the design in the kickoff workshop and signed a long-term agreement before a single line of backend code was written. That level of buy-in before delivery is rare. It happened because the platform solved a problem the client already lived with every day, and the demo was the conversation, not a deck.

Hard adoption metrics will be published once the pilot deployment is complete. We're not going to put up vanity numbers in the meantime.

What we'd tell anyone building something similar

Three things.

If you're going into a regulated environment, design for BYOK from day one. Retrofitting it later is painful and the client will lose patience before you finish.

Don't try to hide model disagreement. Surface it. The reason a panel-of-experts pattern works is that disagreement is information. The Referee model's job isn't to paper over the differences. It's to translate them into something a human can act on.

The user interface for "multiple AI models running in parallel" needs more design attention than the AI part. The model integration is straightforward. Making the experience feel like one tool, not five tools awkwardly stitched together, is the actual product work.

The hard parts of multi-model orchestration are not the models. They're the configuration, the audit trail, the disagreement handling, and the integration discipline. Get those right and the AI part takes care of itself.

Ready to kickstart your project?

Speak with us