Standard Bank IVR — Rohan Saraf

The constraint that defined everything

Standard Bank operated one of the largest telephone banking IVR systems in South Africa. The users — many of them in rural and peri-urban areas, many of them semi-literate, many of them using feature phones without data connectivity — were the primary banking interface for a significant portion of their customer base.

These were not power users navigating a system they understood. These were people for whom telephone banking was the only banking. When the system failed them, they didn't switch to the mobile app or the web portal. They stopped using the bank.

The challenge of an IVR is radical: no visual affordance at all. No confirmation that you've pressed the right button until you hear what happens next. No way to scan ahead. No ability to recognise a control — you have to remember it. And unlike any other interface, the cost of making a wrong turn is time, anxiety, and eroding trust in a system the user can't see.

What we found

The initial research mapped how users actually thought about their banking tasks, and compared that mental model to how the IVR was organised. The gap was significant. The IVR had been architected around the bank's internal product categories. Users organised their mental model around their own goals: pay something, check something, move money.

Card sorting revealed the organisation that made sense to users. Mental model mapping revealed the gaps between their expectations and what the IVR actually did. Together, they gave us enough to redesign the information architecture from the user's perspective rather than the bank's.

We also discovered that audio-only interfaces have a specific failure mode that visual interfaces almost never encounter: word salience. In a spoken prompt, users don't process the full sentence in sequence and extract meaning. They latch onto the most prominent or familiar word in the prompt — and respond to that word, not to the sentence's actual instruction.

"Please enter your telephone banking PIN" sent users to their phone number — because they heard the word "telephone" and assumed it meant their phone number. One word in the wrong position. The word "telephone" had higher salience than "PIN."

The Wizard of Oz test

To test the redesigned IVR before any technical implementation, we ran a Wizard of Oz study. The methodology: a Word document with the full script of the new IVR, two rooms separated by a partition, a researcher with the script acting as the IVR, and a camera recording the user's responses and behaviours.

The user heard a voice through a telephone handset. They believed they were speaking to the real system. The researcher was manually navigating the script based on what the user pressed. The camera captured every hesitation, every mismatch between expectation and outcome, every moment of confusion or confidence.

Wizard of Oz testing is expensive in time and effort. It is also the most accurate way to test a conversational system before it exists. You learn things from watching a real person navigate a real script that you cannot learn from any static prototype or heuristic evaluation.

Round 1 failure

The first round failed. The redesigned architecture was sound, but individual prompts were still producing the salience problem in different places. Users were navigating to the wrong branches not because the architecture was wrong but because the phrasing was wrong.

This was the finding: architecture matters, but in an audio-only interface, phrasing is architecture. The order of words in a prompt is a design decision with measurable consequences.

The word-order discovery

The specific case that clarified this: the PIN prompt. We tested two versions:

Version A — failed

"Please enter your telephone banking PIN."

Users entered their phone number. High salience: "telephone."

Version B — succeeded

"Please enter the PIN for your telephone banking."

Users entered their PIN. High salience: "PIN" (now early in sentence).

Same words. Different order. Dramatically different user behaviour. Moving "PIN" earlier in the sentence meant it was the word users processed first and anchored to. "Telephone banking" became the qualifier — the context for the action — rather than the word the user responded to.

I wrote WHEN IN DOUBT, TEST IT OUT on a whiteboard and left it there. It was the most direct summary of what the project had demonstrated.

Earcons and graduated recovery

Two other significant design decisions came out of the research: earcons and graduated error recovery.

Earcons — brief, distinctive audio tones used as auditory icons — were introduced to provide consistent positional cues throughout the system. The same earcon at the start of every main menu level gave users an acoustic landmark. This was the audio equivalent of visual hierarchy.

Graduated error recovery was designed so that repeated errors escalated predictably — first a clearer prompt, then a simplified prompt, then a transfer to a human agent. The transfer to a human wasn't a failure state — it was a designed handoff. We made it explicit: "Let me connect you to someone who can help." This is the IVR equivalent of an Accountable Handback. The system acknowledges it can't complete the task and explicitly passes control to a human.

The consistency of the * and # keys across every menu level was also formalised — * always goes back, # always confirms. This is the audio equivalent of a consistent affordance. Users could learn it once and rely on it everywhere.

Why this connects forward

An IVR is a proto-agent interface. It navigates on behalf of a user. It maintains state across a conversation. It makes decisions about what to present next based on what the user has done. It has no visual layer — it has to be understood purely through structure and language.

The design challenges of an IVR — word salience, graduated trust, legible handoffs, consistent affordances across an invisible interface — are the same design challenges that arise when designing for LLM-based conversational AI systems, just with different affordances and a much higher capability ceiling.

When I think about how AI agents should communicate intent, declare state, and hand back control to humans, I'm drawing on what I learned in two rooms with a Word document and a telephone handset in South Africa.

Design decisions

Architecture mapped to user mental model (not bank product structure)

Word order as design: PIN before telephone

Earcons as acoustic landmarks

Consistent * / # across all levels

Graduated recovery → intentional human handoff

Connected thinking

Accountable Handback pattern → Bae — conversational extraction →

"Sometimes the solution is a simple change in design. When in doubt, test it out."