The Maker-Checker Pattern: Why Your AI Pipeline Needs a Second Opinion

The Accuracy Problem

A senior data engineer at a multinational bank watched a nightly batch settle $12 million in foreign-exchange fees. The numbers looked right, the logs showed no exceptions, and the downstream reporting system displayed the expected totals. The next morning the compliance team flagged a discrepancy: the fee rate applied to a subset of trades was three percent higher than the current regulator-published rate for the jurisdiction. The error traced back to a single line of deterministic code that read a hard-coded rate from a configuration table that had not been refreshed in six months. The code executed flawlessly; the mistake was hidden in the data it trusted.

Contrast that with a recent LLM-driven customer-support assistant that confidently suggested a loan product that violated the bank's internal risk policy. The model produced a well-structured answer, cited no source, and the support agent accepted it without question. The error was not a syntax bug but a hallucination -- an AI confidence problem that deterministic testing would never have caught. In high-stakes pipelines, relying only on code makes you brittle to stale data; relying only on an LLM makes you vulnerable to confident hallucinations. The tension is clear: you need a mechanism that can catch the blind spots of both.

How the Maker-Checker Pattern Works

The maker-checker pattern introduces a second, independent opinion on every critical result. The "maker" is the traditional deterministic component -- SQL, Spark, or a compiled library -- that produces a concrete output from the input data. The "checker" is an LLM that receives the same raw input, the maker's output, and a textual representation of the governing policy (regulatory text, internal SOPs, or a live knowledge base). The checker recomputes the answer using natural-language reasoning rather than the code path taken by the maker.

When the two answers match within an acceptable tolerance, the system records a high-confidence flag and lets the record flow downstream. When they diverge, the pipeline does not attempt to resolve the conflict automatically. Instead it raises a disagreement flag that routes the record to a human reviewer. The reviewer sees the original input, the maker's result, the checker's result, and the policy source, and can approve, correct, or reject the record before it proceeds.

This works because the maker and the checker have fundamentally different failure modes. Deterministic code fails when data sources are stale, when edge-case logic is missing, or when a bug slips through testing. An LLM fails when it hallucinates, misinterprets a prompt, or lacks the precise numeric rigor required for a calculation. When both agree, the probability that both have erred in the same way is dramatically lower than the probability that either one alone is wrong. The pattern therefore raises validation confidence without demanding that the LLM replace the code.

A Concrete Financial Data Example

In a recent engagement, Labyrinth helped a regional tax authority modernize its quarterly filing pipeline. The maker component was a Python routine that read a jurisdiction-specific tax rate from a configuration table, multiplied it by the taxable base, and emitted the liability. The checker was an LLM prompted with the same transaction data, the maker's liability figure, and the latest excerpt from the jurisdiction's tax code published on the authority's website.

For a batch of 3,200 filings, the maker applied an 8% rate for Jurisdiction X. The checker derived a 7% rate from the regulatory text and flagged the difference. The discrepancy amounted to $45,000 in over-collected tax. A human reviewer inspected the config table, discovered that the rate had been updated six weeks earlier but the table had not been refreshed, and corrected the entry. The maker's output was wrong; the LLM's independent derivation was correct.

If the pipeline had relied solely on the maker, the over-collection would have gone unnoticed until an audit triggered a costly penalty. If it had relied solely on the LLM, the system would have needed to trust a model for every numeric computation, exposing the process to rounding errors and occasional hallucinations. The maker-checker pair caught the error early, required only a brief human intervention, and prevented a regulatory breach.

Implementation Patterns

Choosing how to embed maker-checker into a production pipeline depends on latency tolerance, error cost, and scale. Synchronous checking runs the LLM verification as part of the same transaction that produces the maker's result. This adds latency per record -- typically a few hundred milliseconds -- but guarantees that downstream steps receive a fully validated output. Financial institutions often use synchronous checking for high-value records: large settlements, tax remittances, or risk-critical trades, because the cost of a missed error outweighs the added latency.

Async checking decouples verification from the primary flow. The maker writes its result to a queue, and a separate worker batch-processes records through the LLM checker. Disagreements are written to a review table that downstream consumers can poll. This pattern reduces per-record latency and allows the system to scale LLM inference horizontally. It is suitable for bulk reconciliation jobs where downstream processes can tolerate a pending flag and where the business impact of a delayed correction is low.

Threshold tuning is another critical lever. Not every numeric drift signals a problem. A sub-basis-point variance in a floating-point conversion is expected; a 5% variance in a regulatory calculation is not. Teams should define a tiered threshold matrix that maps the magnitude of disagreement to an action: auto-accept, auto-reject, or route to human review. The matrix balances the cost of false positives -- unnecessary human reviews -- against the cost of false negatives: missed compliance breaches.

Cost management matters because LLM inference is billed per token. Two strategies help. Sample-based checking validates a statistically representative subset of records, providing a confidence interval for the whole batch. Risk-based gating applies full checking only to records that exceed a risk score, such as high-value transactions or those involving jurisdictions with recent regulatory changes. By combining sampling with risk gating, pipelines keep per-record costs reasonable while preserving high confidence where it matters most.

Prompt design for the checker role directly influences audit quality. The prompt should contain three sections: the raw input record, the maker's computed result, and the relevant policy text. The prompt must omit the maker's internal reasoning or code snippets so the LLM derives its answer from first principles using the policy text. A well-structured prompt reads: "Given transaction data X, Y, Z, compute the tax liability using the policy excerpt below. The existing system reports an 8% rate. Is this consistent with the policy?" This forces the LLM to act as an independent auditor rather than a confirmer.

Why Regulated Industries Need This

Regulated sectors share a common pain point: the rulebook evolves faster than the codebase. A medication dosage algorithm written in 2022 may be unsafe under a 2024 clinical guideline that lowers the maximum daily dose. A tax rate embedded in a config file may become obsolete when a jurisdiction revises its legislation. Updating code every time a rule changes is costly, error-prone, and often lags behind compliance deadlines.

The maker-checker pattern turns the rulebook itself into a live source of truth for the checker. By feeding the current regulatory text, clinical guideline, or legal clause into the LLM, the pipeline automatically validates each output against the latest policy. When a rule changes, only the textual source needs updating; the deterministic maker continues to run unchanged, and the checker flags any outputs that no longer comply. This decouples policy maintenance from code maintenance, reduces the risk of outdated logic persisting in production, and provides a clear audit trail for regulators.

How Labyrinth Implements It

In Labyrinth's production environments, the maker and checker run as separate nodes within a LangGraph workflow. The maker node executes the deterministic transformation -- SQL joins, Spark aggregations, or compiled libraries -- and emits a structured result. The checker node receives the same input, the maker's result, and a policy document retrieved from a version-controlled knowledge store. It runs an LLM inference call and returns a confidence score along with its computed value.

Conditional edges in the graph evaluate the agreement. If the maker's value falls within the pre-defined tolerance of the checker's value, the edge routes the record to the downstream enrichment node. If the disagreement exceeds the threshold, the edge routes the record to a human-review node. This node presents a unified interface that shows the input payload, the maker's output, the checker's output, and the source policy excerpt. Reviewers can approve, edit the maker's configuration, or reject the record. Once a decision is recorded, the graph resumes.

A recent deployment involved a 19-node financial pipeline that ingested data from seven disparate sources -- transaction logs, customer master data, market rates, and regulatory feeds. The maker-checker subgraph handled tax calculation for each transaction, catching three out-of-date rate entries and two policy-interpretation mismatches before they reached the settlement system. The architecture and approach are described in our case studies at /work.

Getting Started

If you are building or evaluating a pipeline that processes financial data, healthcare records, or any domain where rules change faster than code, the maker-checker pattern offers a pragmatic path to higher validation confidence. It pairs deterministic reliability with the flexibility of LLM-driven policy checking, without demanding that the model replace the code. To explore whether this pattern fits your specific architecture, reach out at /contact.