KestrelKESTREL ← Overview

Designing for Trust: A UX Case Study on Autonomous Systems

What does a human's screen look like when software starts acting on their behalf? I spent a concept project finding out.


By Manish Todkari · Senior Product Designer


A financial controller named Dana sits down on a Monday morning. Before her first coffee, 47 of her company's supplier invoices have already been paid. She didn't pay them. She didn't approve them. She didn't even see them. An AI agent did — software with standing authority to move the company's real money.

Here's the question I couldn't stop thinking about, the one that became this entire project: what is on Dana's screen?

It can't be the accounts-payable dashboard she used last year. That interface was built on the assumption that Dana processes invoices — queues to work through, fields to fill, tasks to clear. The agent just absorbed all of that. An interface organized around doing is dead weight when the doing has been delegated.

And it can't be a chat window — even though that's what most of our industry is currently bolting onto every agent it ships.

This is a case study about what the screen should be. I built a concept called Kestrel to work it out in detail. No real product, no real payments — a high-fidelity prototype whose only job is to make a way of thinking concrete. This is that thinking.


The shift nobody redesigned for

For forty years, interface design has optimized a single variable: the efficiency of human action. Fewer clicks. Clearer affordances. Shorter paths from intention to done. Every heuristic most of us learned — Fitts's Law, progressive disclosure, recognition over recall — quietly assumes the human is the one performing the work.

Autonomous agents break that assumption at the root. When software can take a goal, plan the steps, and execute them on its own, the human's role changes from doing to delegating and verifying.

But here's the part that makes it a genuine design problem rather than a convenience: accountability doesn't transfer with the labor. When the auditors arrive, they come to Dana. When a fraudulent vendor slips through, it's her signature on the controls documentation. The agent took over the work; Dana kept the responsibility.

That gap — autonomous action, human accountability — is, I'd argue, the defining interface-design problem of the next decade. And it reframes the central question of the whole field. Not "how do we make agents more capable?" (the AI labs are racing on that). Instead: what does a person need to see, and when, to responsibly trust a system acting on their behalf?


Why chat is the wrong instrument

I want to be precise here, because chat is genuinely good at one thing: capturing intent. Natural language is the most expressive interface ever built for telling a system what you want. "Pay our suppliers, optimize for cash flow, capture early-payment discounts, and flag anything unusual" — no form or settings page will ever express that better.

But intent is maybe five percent of Dana's relationship with this agent. The other ninety-five percent is oversight — and conversation is one of the worst oversight instruments ever devised.

You cannot audit a chat thread. You cannot filter one. You cannot hand a message log to your CFO and call it a report. "Everything looks fine!" from an eager assistant is not a control any auditor will accept. Conversation is linear, ephemeral, and ordered by time — the precise opposite of what verification demands, which is structure, persistence, and ordering by importance.

We have actually seen this story before. The command line was powerful, expressive, and beloved by experts — and it kept computing locked away from everyone else for a generation, because querying a system's state one line at a time is brutal compared to seeing it. The graphical interface didn't win by being prettier. It won by making state visible.

Bolting a chat box onto an autonomous agent is handing every user a command line and calling it the future. The real design work is everything around the conversation — the layer that lets an accountable human verify what the agent did. I started calling it the verification layer, and designing it is what Kestrel is about.


Borrowing from systems that solved delegation first

Before sketching a single screen, I went looking for precedent — because humans have delegated consequential work to semi-autonomous systems for decades, and the mature versions all converge on a strikingly similar shape.

Aviation automation. Autopilot never removed the pilot; it transformed the cockpit. Three lessons mapped directly onto Kestrel. Mode awareness — the single most-studied class of cockpit automation accidents comes from a pilot not knowing what mode the system is in, which is why Kestrel's autonomy level is permanently visible and never buried. Tiered alerting — alerts are ranked by the response they require, not by how excited the system is. Manage by exception — the pilot monitors deviations, not every routine input.

Fund management. A portfolio manager acts autonomously inside an investment mandate — a formal document defining exactly what they may do with someone else's money. The mandate is the interface between trust and action. It became the direct ancestor of Kestrel's scope contract.

Executive assistants. Effective human delegation runs on standing instructions ("book anything under $500 without asking"), briefings organized around decisions needed rather than activity performed, and autonomy that's earned through track record.

The synthesis was clear: every mature delegation system makes authority explicit, organizes attention around exceptions, explains its actions, and expands trust incrementally. None of them runs on chat. Those four properties — plus a fifth I'll come to — became the design primitives.


The concept: Kestrel

Kestrel is the oversight interface for Mira, an AI agent running accounts payable end to end for a fictional mid-sized distributor, Northgate Supply Co. Mira ingests invoices from email and EDI, matches them against purchase orders and contracts, validates pricing and terms, schedules payments to optimize cash position, and executes them — autonomously, with real money, inside an authority that Dana defines and can withdraw.

I chose money deliberately: it's the domain where the stakes are most legible and least forgiving. A wrong payment isn't a bad autocomplete; it's money gone. And I named the system for the bird on purpose — a kestrel hovers in place, watching, and strikes only when something warrants it. Constant vigilance, rare intervention. That's the posture the entire interface is built to support.

Five patterns came out of the work. None of them is exotic — and that's the point. They're the cards, modals, and tabs of the agentic era: buildable today, with components every design system already has.


Pattern 1 — The Scope Contract: authority as a first-class object

Scope Contract

Every agentic product has authorization logic somewhere. Almost all of them bury it in a settings page, expressed as toggles and number fields that nobody reads as policy. Kestrel promotes authority to a primary surface — a living, plain-language document modeled on the investment mandate it descends from.

It states permissions (which invoice categories Mira may process, which payment rails it may use), limits (no single payment over $25,000; a weekly aggregate ceiling; first payment to any new vendor always requires approval), standing instructions ("always capture early-payment discounts above 1.5%; never pay before due date for vendors on the dispute list"), and escalation rules (what triggers a hold, who's notified, and what happens by default if no one responds).

Two design decisions matter most here. First, the contract reads as policy with the enforcement logic attached — Dana reads rules, not configuration. Second, every single action in the system links back to the clause that authorized it. Tap a payment and one line reads "Authorized under Scope §2.3 — verified PO match under $25,000." Authority isn't ambient or implied. It's citable.

How this differs from traditional thinking: conventional enterprise software treats permissions as admin plumbing — invisible to the everyday user, owned by IT. Kestrel treats authority as the most important thing on the controller's desk, because when a system acts on your behalf, the boundary of what it's allowed to do is the product.


Pattern 2 — Graduated Intervention: a dial, not a switch

Autonomy

The naive model of agent control is binary — on or off, trusted or not. Real delegation is never binary, so Kestrel's core control is a three-position dial, settable globally and per category of work:

The dial's current position is permanently visible in the header on every screen — aviation's hardest-won lesson, applied literally. And critically, the dial is cheap to move. Dropping a vendor category from Autonomous back to Approve-first is one interaction, not a settings expedition. This is a real design principle, not a detail: if retracting trust is expensive, people either never extend it or never pull it back. Both are failure modes. A trust control that only ratchets one way isn't a control.

How this differs from traditional thinking: most "automation settings" are set-and-forget configuration. Kestrel treats the autonomy level as a living, frequently-adjusted instrument that the human is expected to move as their confidence shifts — closer to a thermostat you nudge daily than a preference you set once.


Pattern 3 — Confidence and Provenance: every action carries its receipts

Action Detail

When Mira executes a payment autonomously, the action arrives with a complete, structured explanation — rendered as a provenance trail that connects each step:

The match (invoice → purchase order → goods receipt, shown, not asserted). The checks (pricing vs. contract, vendor bank details vs. records, duplicate screen — each with a pass/fail and the underlying data). The confidence, displayed honestly as a band — High, Medium, or Review — never a theatrical "98.7%." False precision is its own dark pattern: a two-decimal score implies a rigor the model doesn't have, and erodes exactly the trust it's trying to manufacture. And the reversal path: whether the action can be undone, how, and for how long ("ACH recall available for 36 more hours").

This panel is the entire difference between an agent that is trusted and one that is merely unsupervised.

How this differs from traditional thinking: traditional systems either hide their internal logic entirely or, worse, dress up uncertainty as false precision. Kestrel's stance is that an agent acting on your behalf owes you its reasoning in a form you can audit — and owes you honesty about how sure it actually is.


Pattern 4 — Exception-First Attention: the interface foregrounds what needs you

Dashboard

Dana's home screen inverts the dashboard tradition. Most dashboards lead with activity — totals, charts, a feed of everything that happened. Kestrel leads with what needs Dana and pushes what-happened into the background.

On this particular morning, three invoices are held for review: a price variance beyond tolerance (Hollis Packaging billed 11.8% over the PO), a first-time vendor awaiting its required first-payment approval (Brightline Freight), and a suspected duplicate (Corex Office Supply). Each is a card stating what Mira found, what it recommends, what happens if Dana does nothing, and by when — because an unbounded "pending" state is how oversight quietly rots. Below them sits the agent's posture (cash position, payments scheduled, how close it is to its weekly ceiling). And at the bottom, the routine compresses into a single quiet, expandable row: "44 invoices matched and paid · 100% within scope."

The success metric for this screen is deliberately strange: on a good day, Dana should be able to close it in ninety seconds. This interface isn't trying to maximize engagement. It's trying to minimize warranted attention — to be quiet when things are fine and loud only when they aren't.

How this differs from traditional thinking: engagement-optimized design wants you to stay. Oversight-optimized design wants to give you back your morning. Time-on-screen, the north-star metric of consumer UX, is here a cost to be driven down.


Pattern 5 — Trust Calibration: is your trust correctly placed?

Analytics

This is the pattern I haven't seen any shipping agent product attempt, and it was the late discovery that reshaped the whole project. Trust in an agent isn't a setting you configure once — it's a position you hold and continuously re-evaluate, like a portfolio. So the interface has to support that re-evaluation with evidence.

Kestrel's analytics treat trust itself as the managed object. They show the agent's track record (auto-resolution rate over time, exception rates by type, error and reversal counts) and — most importantly — the accuracy of the agent's own confidence: when Mira said "high confidence," how often was it actually right? That's plotted against the history of autonomy changes, and the panel makes evidence-based recommendations in both directions: "Office-supplies vendors: 340 consecutive approvals, zero corrections in 90 days — consider moving to Autonomous," or equally, a flag to pull a category back if errors cluster.

How this differs from traditional thinking: traditional dashboards report what a system did. Kestrel reports whether you should change your relationship with the system. It closes the loop between the agent's performance and the human's delegation decisions — turning trust from a gut feeling into a calibrated, revisable judgment.


Designing the failure paths

Concept work earns its credibility in the edge cases, so I designed four in full:

The fraudulent bank-detail change. A vendor "emails" new banking details — the classic accounts-payable fraud. Kestrel treats payment-detail changes as a hard exception regardless of autonomy mode, freezes the vendor's queue, and shows a side-by-side of old vs. new with the source of the change. No amount of accumulated trust makes this auto-executable. Some exceptions are constitutional, not configurable.

The confident mistake. Mira pays a duplicate it scored as high-confidence. The reversal matters less than what follows: the error stays permanently visible in the confidence-accuracy analytics, the vendor category is flagged for autonomy review, and the register entry is marked reversed-with-cause. An interface that quietly buries its agent's mistakes is training the user to distrust everything else it says.

The deadline collision. An exception sits unresolved as a payment deadline approaches. The card escalates visibly, states its default ("will hold past due date — late fee $120, discount window closes"), and notifies per the contract's escalation rules. The agent never silently picks between two bad options on your behalf.

The scope boundary. An invoice arrives for $25,400 against a $25,000 ceiling. Mira doesn't pay it and doesn't reject it — it packages it: validation complete, one-tap approve, with a note that this vendor has hit the ceiling four times this quarter and the contract may warrant amendment. The agent is allowed to suggest changes to its own authority. It is never allowed to make them.


What this project argues

Three claims, and I think all three travel well beyond finance:

1. The verification layer is the product. As capable models commoditize execution, the durable design work is the oversight architecture — the thing that makes autonomy adoptable by an accountable human. The agent doing the task is becoming table stakes; the interface that lets you trust it is the differentiator.

2. The primitives generalize. Scope contracts, graduated intervention, confidence-with-provenance, exception-first attention, and trust calibration apply to any consequential agent — code deployment, clinical administration, marketing spend, logistics. Accounts payable was the test case, not the point.

3. Chat is the intent layer, not the interface. Conversation remains the best way to tell an agent what you want. It's among the worst ways to verify what it did. The products that win this era will understand those are two different design problems and build for both.

And the implication I'd put to every designer reading this: the craft of interface design isn't shrinking as agents rise. It's relocating — from choreographing human actions to architecting machine trust. The products that win won't have the smartest agents. They'll be the ones an accountable human can confidently leave alone.


Honest limitations

This is a concept. No usability testing, no real transaction data, a fictional client and a fictional agent. The ninety-second-dashboard claim, the confidence-band model, and the trust-calibration recommendations are all design hypotheses that deserve research. The natural next step is exactly that: put the oversight dashboard in front of working controllers and find out where "manage by exception" collides with how finance teams actually behave under audit pressure.

I built Kestrel to think clearly about a question the whole industry is about to face. The patterns are free to learn from and build on.


Kestrel is a self-initiated design exploration. Northgate Supply Co., Dana Mercer, the agent Mira, and all figures are fictional. The interactive prototype and full source are linked below.

Live prototype: https://tmanish.github.io/kestrel/docs
Source: https://github.com/tmanish/kestrel

If you're working on agentic products and wrestling with the oversight problem, I'd genuinely like to compare notes — find me on LinkedIn

Try it yourself. The prototype is clickable end to end. Launch the prototype →