Security

The shape of an audit trail

An audit trail is not a log file. It is a reconstruction contract: from the recorded state, a stranger can rebuild the decision that put the system here. Most observability is not auditable.

Published 2026-05-16

There are two questions that look similar and are not the same.

The first is what happened. A run started at this time, this tool was called with this scope, this draft was produced, this reviewer accepted it, the output shipped to this destination. Most observability stacks answer this well — when a controller asks “did the April close run finish,” any half-decent log will tell them.

The second is could this have happened any other way. Given everything recorded about the run, can a person who was not in the room reconstruct the decision the system actually made — including the decisions it did not announce? Was the reviewer’s accept the only reviewer-action of the cycle, or was there a prior rejection that vanished? Was the data the agent read the data that was in the source system at that time, or some later version? Did the draft the reviewer signed match the draft the agent produced, or was an intermediate version edited and lost?

Telemetry answers the first. An audit trail answers the second. They look like the same data laid out on the page. They are not.

This article is for the internal audit lead deciding whether an agent-produced record can survive an external auditor’s questioning, the external auditor learning what they can demand from a Yig customer in a walk-through, the security reviewer evaluating data-handling posture, and the compliance officer reading over their shoulders. The argument is one line: most observability is not auditable, and Yig made specific structural choices to be the second thing.

What an audit trail is for

The names tell you the lineage. Telemetry comes from telemetria — measurement at a distance. An audit trail comes from audire — to hear. The first is for the operator watching the system run. The second is for the witness reconstructing what the system did, often years after the original operator has left the company.

Telemetry assumes the watcher trusts the system. The operator shares a vocabulary with the developers, shares an interest in the system working, shares an interpretive frame. A dropped span is a debugging breadcrumb, not evidence of tampering. The dashboard is designed for people who want the system to work.

An audit trail assumes the witness does not trust the system. The external auditor is not on the operator’s side of the boundary. A dropped record is not a breadcrumb to that reader; it is a hole the trail cannot fill. The trail is designed for people entitled to doubt that the system worked the way the operator says it did.

This is why the two records cannot be the same data with a different label. Telemetry can afford to be sampled — losing one in a hundred spans makes the dashboard cheaper without making it lie. An audit trail cannot. Telemetry can afford to be lossy under load, or enriched after the fact. An audit trail cannot, because every after-the-fact derivation is a place a future auditor will ask who derived it, on what authority, and how do we know the derivation did not change what the record shows.

Telemetry is a report. An audit trail is a reconstruction contract.

A reconstruction contract has a stricter shape. From the recorded entries alone, a person who was not in the room can rebuild the decision the system made. Not approximate it. Rebuild it. If reading the trail requires also believing what the operator says about it, the trail has not done its job.

Most logging frameworks were not built for that promise. They were built for the dashboard. The difference does not surface in normal operation. It surfaces in the audit response, three years after the run, when the original engineer has left, the original operator has left, and the external auditor is asking why a journal entry was approved.

The reconstruction invariant

Every entry in an audit trail has to satisfy one property, and a trail that fails it is decorative.

The property: from the entry alone, plus the chain of entries before it, a reader who does not trust the operator, the developer, or the agent can determine what was done, by whom, when, against what data of record, and on whose authority. Not “can be told”. Can determine. The reader does not call anyone. The reader does not consult the dashboard. The reader reads.

Each entry has to carry enough structural identity that a stranger can place it in a chain and verify the chain has not been cut. Each entry identifies the actor with an authority that exists outside the agent’s runtime — a workspace identity, an IT-provisioned account, a signed approval — so that “the agent did it” is not a sentence the agent gets to write about itself. Each entry references the data of record at the moment of the action, not as it stood when the trail was reviewed, so that subsequent edits to the source system cannot quietly rewrite what the agent saw.

Most logging frameworks fail this by default. The frameworks were designed for a trust frame the audit context does not share. A standard application log enriches events from in-memory state at the time of writing — a feature when the operator and developer share the dashboard, a contamination when an auditor reads the entry and cannot tell whether the enrichment reflected the run or the post-hoc explanation. A standard log can be re-shipped, re-indexed, re-correlated. Each of those operations is a moment where a determined party could rewrite the trail. The framework does not prevent it. It was not asked to.

The audit trail has to prevent it, and the way it prevents it is structural. Entries are append-only. The chain is forward-linked, so a deletion is visible as a gap. Identities are external to the agent — the reviewer’s signature is provisioned by IT, not minted by the runtime. Timestamps come from a source the agent does not control. The data the agent acted on is referenced by content shape, not by a snapshot the agent stored — the auditor traces back to the customer’s general ledger to verify what the agent saw.

Every entry in an audit trail must let a stranger rebuild the decision without consulting anyone in the room.

If an entry fails that test — if reading it requires also believing the agent about something the entry does not itself prove — the entry is not an audit record. It is a comfortable narrative. The narrative will not survive an external auditor’s question, which is usually some version of: how do you know this is what actually happened, and not what someone wrote down after.

What we record per agent step

The trail is built from a small set of event classes. Each is a row of structured data with a fixed shape. The shape is not negotiable across runs — if it varied, the chain would not verify as a chain.

The deliberate count is six. More would mean the boundary between audit and telemetry is blurring. Fewer would mean some part of the decision is left to inference by the reader, which fails the invariant.

Event class	What is captured	Why the fields exist	What would break without it
Run invocation	Workflow identity and version, scope (entity, period), invoking identity, invoking surface, UTC timestamp.	Anchors every downstream entry to a named workflow contract. Names the human who pressed the button. The run happened against a known version, not an unspecified template.	The chain has no head. The auditor cannot answer “who started this run, and what were they running” — the first question of any walk-through.
Tool call	Tool identifier, ordered step number, structural shape of input and output, duration, status.	Lets the auditor reconstruct the order of work and the scope of each data read. Shape, not content, keeps the log from becoming a copy of the data while still proving the right slice was read.	The agent’s actions become untraceable. Verifying that the read was scoped to the right entity and period becomes impossible.
Draft production	Draft identifier, version number, output shape, count and identity of flags raised, presented-to identity, UTC timestamp.	Records that the agent produced a reviewable artefact at a specific time, presented to a named reviewer, with uncertainty surfaced as flags rather than buried.	The hand-off has no record. The auditor cannot point to where the agent’s part ended and the reviewer’s began.
Reviewer action	Action type (accept, reject, accept-with-edits, comment, flag-resolve), reviewer identity, target, UTC timestamp, structural diff of any edits, free-text comment.	Records the controller’s, FP&A lead’s, or audit lead’s judgement step by step. Each rejection and each edit is preserved, not collapsed into the final decision.	A reviewer who rejected the first draft and approved the third looks identical to one who rubber-stamped the first. The gate becomes performative.
Approval	Approved draft identifier and version, approving identity, UTC timestamp, write-back destination, model provider and identifier used.	The entry the auditor cites when asked “who signed this”. Binds a specific draft version to a human signature and a destination in the customer’s stack. Model identity answers future regulator questions on provenance.	The agent’s output ships without provable authority. There is a draft and reviewer actions, but no record of the gate being satisfied.
Surface delivery	Delivery target (Slack thread, Excel sidebar cell range, Word footnote anchor, CLI session, write-back path), outcome (delivered, queued, failed), reason on failure, UTC timestamp.	The trail does not end at approval. The output has to land where the operator was promised, and a failed delivery is itself an audit-meaningful event.	The audit cannot prove the operator received the output. The question “did this number actually land in the customer’s ledger” cannot be answered without leaving the trail.

Six rows. Each was a deliberate yes. Each field on each row had to argue for its place against the alternative of being inferred from context, and being inferred from context is exactly what an audit trail is built to prevent.

The shape of the trail is six event classes, not five and not seven, because the chain only verifies as a chain when the joints are explicit.

A trail that adds a seventh row — “agent internal reasoning step” — has crossed from audit back into telemetry. The seventh row is for the dashboard. A trail that drops the surface-delivery row has cut the end off the chain. The output ships and the trail does not say where. Either direction breaks the invariant.

What we don’t record, and why each absence is a hard call

A trail is also what it does not contain. Each omission is a position the architecture is taking on what the trail is and is not for.

There are four deliberate omissions, and each one has a defender inside the company. Each one was held out anyway.

Chain-of-thought from the model. The agent’s internal reasoning — the intermediate steps between a tool result and the next call — is not in the trail. Recording it feels like it should produce a more honest record. It does the opposite. Chain-of-thought is fluent, and fluent text invites the reader to believe it as a description of the decision. The actual decision is the tool calls and the draft, not the prose around them. A trail that contains a thousand words of model reasoning before each tool call teaches the reviewer to read the wrong thing.
Retrieved-but-unused context. A workflow may pull three months of trial balance and use only one. The unused two months are part of the read but not part of the decision. Logging all of it would invite a future auditor to ask why information visible to the agent was not used, and now the agent is being audited for material it correctly ignored. The trail records the scope of the read and the shape of what was returned. The fact that some of it did not influence the draft is the agent doing its job, not evidence of negligence. The unused context is in the customer’s general ledger if anyone wants to look.
Raw business content. The trail records the shape of inputs and outputs — that a trial balance with seventeen rows and four FX-converted balances was read, that the draft contained a variance schedule with twelve line items and three flags. It does not record the values. The numbers live in the customer’s stack. Mirroring them into the trail would create a second copy that has to be defended, retained, encrypted, exported on request, and disclosed under breach — and a copy that may diverge silently from the source. Two records of the same number is the failure mode the audit trail is designed against.
Token-level model metadata. Probability distributions, attention patterns, sampling parameters, alternative tokens the model considered — none are in the trail. With token-level data, a reviewer could in principle detect anomalous generation or build internal benchmarks. The reason this fails for an audit trail: token-level metadata is not interpretable to the reader the trail is for. An external auditor asking “did this control work” is not equipped to evaluate a probability distribution, and a trail that asks them to is asking them to take the agent’s word for what the distribution means.

Each absence is a position. The agent could argue that recording these things would make it look more transparent. The audit trail’s job is not to look transparent. It is to be reconstructable by a stranger who does not trust the system. The four omissions all fail that test in different ways, and they are out.

A trail’s omissions are part of the trail. They are what the trail is refusing to be.

Reading an audit trail in practice

Imagine the audit response three quarters from now. Acme’s German subsidiary DE-001 ran an intercompany clearing reconciliation on day five of the April 2026 close. The reviewer accepted with one edit. The output was written back to the entity’s working trial balance. The external auditor is now sampling that reconciliation as part of fieldwork.

  Run trail · workflow ic-clearing-recon v3.2 · entity DE-001 · period 2026-04

  [t=0]    Run invocation · Jordan Kim (FP&A lead, AD-provisioned)
           Surface: Slack DM #close-de-001
           Scope: DE-001 · 2026-04 · counterparties [US-001, UK-002, SG-001]

  [t+8s]   Tool call · ledger-read DE-001 IC clearing · 42 rows · 2 currencies
  [t+11s]  Tool call · ledger-read US-001 IC mirror   · 38 rows · 2 currencies
  [t+22s]  Tool call · fx-rate-read USD/EUR closing   · source: customer fx table

  [t+34s]  Draft production · d4f1 v1
           Output: variance schedule · 12 line items · 1 flag
           Flag: line 7 · $14,200 unmatched · timing-difference candidate

  [t+9m]   Reviewer action · flag-resolve (line 7)
           Comment: "Cleared 2026-05-02 post-period accrual; not adjusting."

  [t+11m]  Reviewer action · accept-with-edits
           Diff: line 11 commentary expanded (12 words → 18 words)
           Resulting version: d4f1 v2

  [t+11m]  Approval · d4f1 v2 · approved by Jordan Kim
           Destination: DE-001 working TB · cell range C14:F26

  [t+11m]  Surface delivery · delivered to customer's working TB
           Anchor recorded: Slack thread #close-de-001

The trail stands on its own. None of those events requires the operator to corroborate. None requires the vendor to corroborate. The auditor reads the trail and walks each event back to its source in the customer’s own stack — the IT directory for Jordan’s authority, the workflow definition store for version 3.2, the general ledger as it stood that day for the row counts, the working trial balance for the cell range that received the output. End-to-end chain from invocation to write-back, intact, no gap.

The walk-through above is the easy case — a run that completed, a reviewer who acted, a draft that landed. The harder case is the run that failed, the reviewer who walked away mid-cycle, the draft that was rejected and never re-attempted, the surface delivery that queued for an hour and was overtaken by a fresh run with the same scope. The architecture has answers — incomplete runs are themselves audit-meaningful, failed deliveries are events, the chain continues with the gaps named rather than hidden. Whether those answers hold under questioning is what the next year of audits will test. If they do not hold, we will have shipped a trail that looked like a contract and was not.

A run trail that holds under questioning is the agent’s standing in the room. Without one, the agent has no standing at all.

The bet behind the architecture is simple. An agent in finance has to produce work a stranger can defend. Telemetry will not do it. Most logging will not do it. A trail built to the reconstruction invariant might. If it does not, the agent has no place in the close.