Concepts

The yellow flag

Why designing AI uncertainty as a first-class product feature is the correct architectural choice — not a limitation to hide.

Published 2026-05-11

Most AI products hide uncertainty.

The model produces a confident output. Qualifications are in the fine print, if anywhere. The product is designed to look capable, and looking capable means looking certain.

In many domains this is an acceptable tradeoff. The user prefers a clean answer over a hedged one. A wrong answer that looks right will be caught eventually. The cost of the mistake is recoverable.

Finance is not this domain.

The cost of hidden uncertainty in accounting

A wrong answer that looks right in a journal entry or a close package does not surface during the review if the reviewer cannot tell it is wrong.

That is the risk. Not the wrong answer — wrong answers are caught by reviewers who are looking for them. The risk is the wrong answer dressed as a right answer: the accrual populated with a last-period figure when the contract actually changed, the IC pair reconciled by netting a real difference, the variance explained by a template phrase that no longer applies.

A finance reviewer scanning a draft at 11pm is not performing a full independent recomputation. They are performing a reasonableness check. A draft that looks reasonable passes. An incorrect draft that was built to look reasonable fails the close, the audit, or both — weeks later, when the cost is much higher.

The worst output an agent can produce is a wrong answer presented with unearned confidence.

What the yellow flag does

The yellow flag is a design pattern that inverts this.

When Yig encounters a judgment call it cannot resolve from the available data — a variance outside the line’s historical range, a forward-looking assumption with no source document, a policy choice that requires a human decision — it does not fill in the answer.

It marks the section. It states what it does not know and why. It waits.

Yig flagged this line because the variance exceeds the line’s 12-month volatility band. Human judgement required: is this a one-time item or a recurring adjustment?

The reviewer answers. Yig regenerates the surrounding content with the reviewer’s input woven in. The output reflects the reviewer, not a guess.

The reviewer’s answer goes into the audit log. The yellow flag becomes a resolved item with a named decision-maker and a timestamp. The draft is now more defensible than any draft an agent could have produced by guessing.

Why this is a product decision, not a limitation

There is a version of this feature that reads as a failure: the agent could not figure it out, so it gave up. That reading is wrong, and it is the reading that causes teams to turn off uncertainty flags.

The correct reading: the agent knows what it knows and knows what it does not know, and it distinguishes between them.

This is a capability, not a gap. It is harder to build than a system that always produces an answer. It requires the agent to maintain a calibrated model of its own confidence and to apply a threshold below which it asks rather than guesses.

The flag is also a teaching tool. A reviewer who sees the agent flag a line, reads the reason, and makes the call learns something about the pattern the agent recognised. Over a year of close cycles, the team that reviewed flagged lines understands the data better. The team that accepted confident outputs learns nothing.

Instruments teach the operators who use them. Black boxes do not.

The threshold question

The flag rate is a tunable parameter. Too many flags and the reviewer ignores them. Too few and the agent is hiding uncertainty.

The right calibration depends on the line, the period, the team, and the history. A first-run close on a new entity warrants more flags than a tenth run on a stable entity. A line with a volatile history warrants a wider tolerance band than a fixed-cost line.

Getting this calibration right is part of what Yig learns over successive cycles with a team — not by training on their data, but by accumulating context about what this controller treats as a material variance.

What the yellow flag rules out

It rules out an agent that produces a completed close package without any flags. A zero-flag run on a real close is either a sign that the period was genuinely clean, or a sign that the agent’s confidence thresholds are wrong. The reviewer should know which.

It rules out the design choice — common in early-generation AI finance tools — of surfacing a number and burying the caveat. If there is a caveat, the caveat is the headline, not the footnote.

It rules out the agent filling in judgment calls with the most-likely answer from training data. A close is not a classification problem. The correct answer for this period depends on this entity’s specific situation, and no training data can provide that.

The trust posture

When a finance team first encounters the yellow flag, the reaction is sometimes frustration. The agent doesn’t know? What are we paying for?

After the first close, the reaction changes. The flags were the right calls. The lines the agent flagged were the lines that needed human decisions. The lines it did not flag were clean.

After the third close, the yellow flag is the feature. It is the thing that lets the reviewer trust the rest of the draft — because they know the uncertain parts are marked.

That is the trust posture we are building toward: not “the agent is always right,” but “the agent is reliable about what it does not know.”

The second posture is more useful in finance. It is also more honest. And in the long run, honesty compounds.