top of page
Search

Phone Self-Service for Government: Touch-Tone, Speech Recognition, and Conversational AI

  • pbahar9
  • 24 minutes ago
  • 15 min read

GOVERNMENT TECHNOLOGY BRIEF


A factual comparison to support informed decision-making by state and local government leaders

State and local government agencies - courts, municipalities, counties, and other public bodies -have operated phone self-service systems for decades. The underlying technology has evolved through three distinct generations: touch-tone IVR, speech recognition IVR, and conversational AI. Each carries different capabilities, costs, risks, and operational requirements. This brief describes each generation factually and side-by-side, so agency leaders can evaluate which approach, or combination of approaches, fits their environment.


Government agencies are not private businesses. Budget constraints are real, staffing is limited, and public trust is a foundational obligation. These realities shape how phone technology should be evaluated regardless of the agency type.


The Three Generations of Government Phone Self-Service


Generation 1 — Touch-Tone IVR (DTMF)

Touch-tone IVR, based on Dual-Tone Multi-Frequency (DTMF) signaling, has been the backbone of government phone systems since the 1980s and 1990s. Callers listen to a recorded menu and press a number on their keypad to select an option. The system routes calls based on which key is pressed.


How It Works

A caller dials in and hears a recorded prompt: “For permit status, press 1. For payment information, press 2.” The caller presses a key; the system routes accordingly. There is no language processing and no interpretation of intent. The system responds only to the specific key pressed.

The call flow is authored entirely in advance by agency staff or a vendor. Every branch, every prompt, and every response is pre-recorded or pre-scripted. The system does exactly what it is configured to do—nothing more.


What Touch-Tone Does Well

  • Handles high-volume routine inquiries reliably (case status, payment information, permit status, jury duty, utility billing, license renewals)

  • Fully deterministic: every caller receives exactly the same response for the same input

  • Little-to-no ongoing tuning, training, or monitoring required once deployed

  • Compatible with all phone types, including landlines with no internet connection

  • Straight-forward to audit: call logs map directly to menu selections

  • Accessible to callers with limited English proficiency when paired with pre-recorded multilingual prompts

  • Vendor-neutral: runs on most telephony platforms

 

Where Touch-Tone Falls Short

  • Menu depth: depending on complexity, callers may be required to navigate 3–5 levels to reach relevant information

  • Does not understand intent—callers must conform to the system’s structure

  • Callers unfamiliar with the menu structure may struggle to locate the right option

  • Adding new inquiry types requires reprogramming call flows and re-recording prompts

  • High “zero-out” rates when callers cannot find their option or become frustrated


Generation 2 — Speech Recognition IVR (ASR / NLU)

Speech recognition IVR emerged broadly in the 2000s. Instead of pressing a key, callers speak their response. The system uses Automated Speech Recognition (ASR) to convert spoken words to text, and Natural Language Understanding (NLU) to map that text to a pre-defined intent.

This is not the same technology as conversational AI. Speech recognition IVR is still menu-driven and rule-based—it accepts voice instead of keypad input, but the system still operates within a defined call flow authored by humans.


How It Works

A caller hears: “What can I help you with today?” The caller says “permit renewal.” The ASR engine transcribes the audio. The NLU engine matches “permit renewal” to the appropriate intent. The system routes accordingly—the same destination as if the caller had pressed the corresponding key in a touch-tone system.

More advanced speech IVR systems can capture spoken data—account numbers, dates of birth, zip codes—and validate them against a database lookup. These remain rule-based transactions, not open-ended conversations.


What Speech Recognition IVR Does Well

  • More natural entry point for callers who do not know which menu option matches their need

  • Supports spoken capture of structured data (account numbers, dates, reference numbers)

  • Can reduce menu depth for callers who speak their intent clearly

  • Still deterministic within defined intents - responses remain controlled and auditable

  • Established technology with a long track record in government environments

 

Where Speech Recognition IVR Falls Short

  • Recognition accuracy varies by accent, background noise, and call quality—typical accuracy rates range from 85–95%, meaning 5–15% of utterances may be misrecognized

  • Requires initial grammar or intent design and ongoing tuning as recognition errors surface

  • Callers who speak outside the defined grammar may receive incorrect routing

  • Does not handle open-ended questions or multi-step reasoning

  • More complex to implement and maintain than touch-tone; vendor dependency increases

  • Callers with strong accents, speech impairments, or poor audio connections experience higher failure rates

  • Grammar maintenance—updating recognized phrases and intents—requires ongoing staff time or vendor support


Generation 3 — Conversational AI (Large Language Models)

Conversational AI represents a fundamentally different architecture. Rather than mapping spoken input to pre-defined intents, these systems use Large Language Models (LLMs) to generate responses dynamically based on the caller’s input and a set of governing instructions.

The caller does not need to match a grammar or select a menu option. They can describe their situation in plain language, and the system generates a response. This is the technology behind tools like ChatGPT, and it is increasingly being applied to phone and chat self-service in government settings.


How It Works

A caller says: “I got a summons to report for jury duty next Tuesday but I’m going to be out of town—what do I do?” The system uses ASR to transcribe, then sends the transcription to an LLM. The LLM generates a response based on the agency’s policies and procedures, which have been loaded into the system as reference material.

Unlike speech IVR, the response is generated—not retrieved from a pre-written script. This gives conversational AI its flexibility, and also its risk.


What Conversational AI Does Well

  • Handles a wider range of caller phrasings without requiring grammar updates

  • Can engage in multi-turn exchanges to clarify caller needs

  • Supports robust multilingual capability without separate per-language programming

  • Can reduce zero-out rates for callers with complex or unusual inquiries

  • Can be configured to handle procedural questions that fall outside structured IVR menus

 

Where Conversational AI Falls Short

  • Responses are generated, not scripted - accuracy depends on the quality of governing instructions and ongoing monitoring

  • Can produce incorrect, incomplete, or misleading responses (“hallucination”) if not properly constrained

  • Requires ongoing performance monitoring, prompt refinement, and staff oversight

  • Usage costs are consumption-based and can be unpredictable—see Cost Section

  • Vendor dependency is high; LLM providers set pricing, availability, and model behavior

  • Audit trails are more complex than rule-based systems

  • Establishing guardrails to prevent the AI from providing guidance outside its intended scope requires deliberate design and testing


How Conversational AI Works Under the Hood


Agency leaders evaluating conversational AI do not need to become technologists; however, understanding the three core components of a conversational AI pipeline helps clarify where the system can be controlled - and where it can fail. These three components are ASR (input), RAG (retrieval), and NLG (output).


ASR: Automated Speech Recognition (The Input Layer)

ASR is the component that converts a caller’s spoken words into text. It is present in both speech recognition IVR and conversational AI systems. The quality of ASR directly affects everything downstream—if the caller’s words are transcribed incorrectly, the system works from flawed input.

ASR accuracy varies by vendor, audio quality, accent, and background noise. In a conversational AI system, ASR errors can be partially compensated for by the LLM’s ability to infer meaning from imperfect text. In a speech IVR system, a misrecognized word may cause immediate misrouting.

Government agencies should ask vendors to provide ASR accuracy benchmarks specific to the caller population they serve - particularly for languages, dialects, and demographic groups that are common in their jurisdiction.

 

RAG: Retrieval-Augmented Generation (The Knowledge Layer)

RAG is the mechanism by which a conversational AI system is grounded in agency-approved content. Rather than relying on the LLM’s general training data - which may be outdated, jurisdiction-specific, or simply incorrect - RAG directs the system to retrieve relevant information from a defined document set before generating a response.

In a government context, that document set might include fee schedules, procedural rules, hours and location information, eligibility criteria, or frequently asked questions. When a caller asks a question, the RAG system searches the document set for relevant content and passes it to the LLM as context. The LLM then generates a response based on that retrieved content rather than from general knowledge.


Why RAG Matters for Government Agencies

  • RAG is the primary mechanism for constraining AI responses to agency-approved information

  • Without RAG, a conversational AI system may draw on its general training data to answer questions - potentially producing responses that are inaccurate for the specific jurisdiction, outdated, or outside the agency’s intended scope

  • With RAG, responses can be traced back to source documents, which supports auditability

  • RAG helps prevent the system from providing information the agency has not reviewed and approved

 

What RAG Does Not Solve

  • RAG reduces but does not eliminate the risk of incorrect responses. The LLM still generates the final answer, and generation can introduce errors even when the retrieved content is accurate

  • The knowledge base must be accurate, current, and complete. If agency documents contain errors or outdated information, RAG will ground responses in that incorrect content

  • Knowledge bases require ongoing maintenance. Policy changes, fee updates, procedural revisions, and new service offerings must be reflected promptly in the source documents

  • RAG does not prevent the system from generating a plausible-sounding response when the knowledge base does not contain a clear answer - this remains a hallucination risk

 

The Operational Burden of RAG

Maintaining a RAG knowledge base is an ongoing operational commitment, not a one-time setup task. Someone within the agency - or a contracted vendor - must own the document set, review it for accuracy, update it when policies change, and test the system’s responses after updates. This is a staffing cost that is often underestimated in initial procurement discussions.


NLG: Natural Language Generation (The Output Layer)

NLG is the component that produces the words a caller hears or reads. In a conversational AI system, the LLM is the NLG engine. It takes the caller’s transcribed input and the retrieved RAG content and generates a response in natural language.

NLG is what makes conversational AI feel different from traditional IVR - responses are fluid, contextual, and can adapt to the specifics of the caller’s question. It is also the source of the technology’s most significant risk for government agencies.


Why NLG Is a Distinct Risk Factor

  • NLG generates text dynamically. Unlike scripted IVR responses, no human has reviewed or approved the specific words a caller receives

  • Even when RAG supplies accurate source content, NLG can phrase a response in a way that is incomplete, ambiguous, or contextually misleading

  • RAG and NLG are two separate failure points. A system can retrieve the correct policy document (RAG working correctly) and still generate a response that misrepresents that policy (NLG introducing error)

  • This distinction - between what is retrieved and what is said - is important for agencies to understand when evaluating vendor claims about system accuracy

 

What This Means for Oversight

Because NLG produces generated output rather than scripted responses, government agencies cannot review and pre-approve every possible response the system might produce. This makes ongoing monitoring a structural requirement rather than an optional quality check. Agencies should establish a regular process for reviewing samples of actual caller interactions, testing the system against known policy questions, and updating governing instructions when responses are found to be inaccurate.


Key Takeaway for Agency Leaders

The three-component pipeline - ASR (input), RAG (retrieval), NLG (output) - represents the architecture of most enterprise conversational AI deployments. Vendors may use different terminology, but the underlying components are consistent.

 

When evaluating a conversational AI product, agencies should ask vendors to explain how each layer works, how failures in each layer are detected, and what controls exist at each stage. A vendor that cannot explain these components clearly may not have the governance architecture a government agency requires.


Side-by-Side Comparison


Capability Comparison


Capability

Touch-Tone IVR

Speech Recognition IVR

Conversational AI

Call handling mode

Keypress selection

Spoken intent matching

Natural language generation

Caller input type

Keypad digits

Spoken words (defined grammar)

Open-ended speech or text

Response type

Pre-recorded / scripted

Pre-scripted (routed by intent)

Dynamically generated (NLG)

Intent matching

None — key = route

Grammar-based matching

LLM inference

Knowledge source

Pre-recorded scripts

Pre-scripted intents

RAG knowledge base + LLM

Handles ambiguous input

No

Limited

Better (but not guaranteed)

Multi-turn conversation

No

Limited (structured capture)

Yes

Language support

Pre-recorded per language

Separate grammar per language

Multi-language via model

Reliability / uptime

Very high

High

Dependent on vendor API uptime

Deterministic output

Yes

Yes (within grammar)

No — responses are generated

Auditability

Simple (key logs)

Moderate (intent logs)

More complex (generated logs + RAG traces)

Escalation to staff

Configurable

Configurable

Configurable, but harder to predict trigger


Operational Comparison

Operational Factor

Touch-Tone IVR

Speech Recognition IVR

Conversational AI

Initial setup complexity

Low to moderate

Moderate

High

Ongoing maintenance

Low (prompt updates only)

Moderate (grammar tuning)

High (RAG knowledge base, prompt, model, monitoring)

Staff expertise required

Low

Moderate

Moderate to High

Vendor dependency

Low to moderate

Moderate

High

Time to deploy changes

Hours to days

Days to weeks

Days (but testing required)

Risk of incorrect output

Very low (scripted)

Low (bounded by grammar)

Moderate to high without oversight

Monitoring requirements

Low

Moderate

Ongoing / continuous

Knowledge maintenance

None (scripts are static)

Grammar updates per intent

RAG knowledge base must be kept current

Performance degradation over time

None (static)

Possible if call patterns shift

Yes — requires active management


Cost Considerations: Hard Costs and Soft Costs

 

Cost comparisons between these three technologies are often incomplete when they focus only on vendor licensing fees. Agencies should evaluate both hard costs (direct expenditures) and soft costs (staff time, risk exposure, and operational burden) across the full lifecycle of a system.

 

Hard Costs by Technology

Cost Category

Touch-Tone IVR

Speech Recognition IVR

Conversational AI

Implementation / setup

Typically the lowest of the three; varies by call flow complexity and vendor; agencies should request itemized quotes

Higher than touch-tone due to grammar and intent development; agencies should request quotes based on the number of intents and languages required

Typically the highest of the three; scope includes integration, prompt engineering, RAG knowledge base build, and testing; agencies should request fully itemized quotes

Platform / hosting (annual)

Typically the lowest; varies by vendor, call volume, and telephony platform; request multi-year pricing

Generally higher than touch-tone due to ASR/NLU platform licensing; varies by vendor and call volume

Typically the highest and least predictable; includes both platform licensing and variable LLM API usage charges; total annual cost depends on call volume and average conversation length

LLM usage / API costs

None

None

Variable — charged per token or per interaction; fluctuates with call volume and conversation length; see callout below

RAG knowledge base build

Not applicable

Not applicable

One-time build cost varies with document volume and complexity; ongoing maintenance is a separate cost item

Ongoing tuning / optimization

Low — prompt re-recording only

Moderate — grammar updates, intent review

High — continuous prompt refinement, RAG maintenance, model evaluation

Vendor support / SLA

Included or low add-on cost

Moderate — grammar support, ASR tuning

Higher — model updates, compliance review, escalation paths

Integration costs (back-end)

Moderate (database lookup)

Moderate

Moderate to high (data grounding, RAG integration, guardrail testing)

⚠  A Note on Conversational AI Usage Costs

Conversational AI platforms typically charge based on the number of tokens (units of text) processed per interaction. A single phone call involving several exchanges can consume thousands of tokens. Pricing varies by vendor and model, and the market continues to evolve.

 

Unlike fixed annual licensing fees, API usage costs fluctuate with call volume and conversation length. Agencies should request written cost projections from vendors based on their actual call volumes, modeled across realistic low, average, and peak scenarios, before committing to a conversational AI deployment.

 

Agencies should also ask whether RAG retrieval operations carry separate per-query charges, as some platforms bill retrieval and generation independently.


Soft Costs by Technology

Soft costs are often overlooked in procurement decisions but can equal or exceed hard costs over a system’s lifecycle.


Soft Cost Category

Touch-Tone IVR

Speech Recognition IVR

Conversational AI

Staff time to manage system

Low  - changes are infrequent and simple

Moderate - grammar updates, recognition review

High - continuous prompt management, RAG knowledge base maintenance, monitoring, review of AI outputs

Staff training

Minimal

Moderate

Moderate to high; staff need skills to manage AI behavior and evaluate output quality

Monitoring burden

Low - review logs periodically

Moderate - review misrecognition patterns

Ongoing - AI outputs must be reviewed regularly to catch errors before they affect the public

Risk of public misinformation

Very low (scripted content only)

Low (bounded responses)

Moderate - generated responses can contain errors; in regulated or legally sensitive contexts, consequences may include public harm or liability exposure

Incident response overhead

Low

Low to moderate

Moderate to high - AI errors may require immediate intervention and public correction

Vendor lock-in risk

Low

Moderate

High - changing LLM providers may require significant re-implementation of prompts, RAG architecture, and integrations

Budget predictability

High

High

Low - usage-based pricing creates variable monthly costs


Governance and Compliance Considerations

 

Government agencies operate under legal and ethical obligations that differ from private-sector organizations. Technology decisions must account for these constraints regardless of cost or capability.

Governance Factor

Touch-Tone IVR

Speech Recognition IVR

Conversational AI

ADA / accessibility compliance

Well-established compliance path

Established; some accommodations needed

Requires specific design; may need parallel access path

Language access (LEP callers)

Pre-recorded multilingual menus

Per-language grammar development required

Strong multilingual capability; requires testing per language

Records retention compliance

Straightforward (call logs)

Moderate (intent + audio logs)

More complex (conversation transcripts, RAG traces, model version tracking)

Avoidance of out-of-scope guidance

Guaranteed by scripted content

Guaranteed within grammar

Must be engineered via RAG constraints and actively maintained

Public disclosure / transparency

Low complexity

Low to moderate

Higher — AI involvement should be disclosed per emerging standards; some jurisdictions have enacted or are considering requirements

Audit / discovery readiness

Simple

Moderate

More complex; logs include generated text and RAG retrieval records that may be scrutinized

Procurement / approval process

Standard IT procurement

Standard IT procurement

May require additional legal, ethics, or policy review in some jurisdictions

 

Deployment Patterns Government Agencies Are Using Today

 

Agencies adopting newer technologies rarely replace existing systems entirely. The following patterns represent how government organizations are combining these technologies in practice.


Pattern 1: Touch-Tone Only

Many agencies continue to operate purely touch-tone IVR for phone self-service. This remains appropriate for agencies with high call volumes of routine inquiry types, limited IT or vendor support capacity, and stable, predictable inquiry patterns. Operational costs are well-understood and manageable.


Pattern 2: Speech Recognition IVR

Agencies with more diverse caller populations or frequent “zero-out” problems have adopted speech recognition IVR as an upgrade. The transition improves caller experience while maintaining controlled, auditable call flows. This approach is well-supported by established vendors with government experience.


Pattern 3: Hybrid (IVR + Conversational AI Layer)

Some agencies are deploying conversational AI for a defined subset of inquiry types - typically procedural or informational questions - while retaining touch-tone or speech IVR for structured transactions like payment processing or account lookups. This approach limits the scope of AI exposure and contains cost and risk. The RAG knowledge base in these deployments is typically scoped to the specific inquiry types handled by the AI layer.


Pattern 4: Web Chatbot (Not Phone)

A number of agencies have deployed conversational AI in text-based web chat rather than voice. Transcripts are easier to review, the interaction is less time-pressured, and RAG knowledge base gaps are easier to identify in written logs. This allows agencies to gain operational experience with the full ASR/RAG/NLG pipeline at lower risk before extending it to phone self-service.

 

Questions Agency Leaders Should Ask Before Selecting a Solution

 

About Costs

  • What is the total cost of ownership over 3-5 years, including implementation, annual platform fees, LLM usage costs, RAG build and maintenance, integration, and staff time?

  • For conversational AI: what is the projected monthly cost at our actual call volume, and how does that cost scale if call volume increases by 20%? By 50%?

  • Are LLM usage costs and RAG retrieval costs billed separately? Are either capped or variable?

  • What is the exit cost if we need to change vendors?

 

About Operations

  • What staff time is required to manage and maintain this system on an ongoing basis - including RAG knowledge base updates?

  • Who in our organization will own the RAG knowledge base, and how will policy changes be reflected in it?

  • Who will monitor AI output quality, and how often?

  • What happens when the system produces an incorrect response? What is the correction process?

  • What is the vendor’s SLA for uptime, and what are remedies if the system is unavailable?

 

About Architecture

  • How does the system use RAG? What document types and formats does it support?

  • How are ASR accuracy rates measured, and what are the benchmarks for the languages and caller populations we serve?

  • How does the system detect when the RAG knowledge base does not contain a clear answer, and what does it do in that case?

  • How are NLG outputs monitored for accuracy and appropriateness?

 

About Governance and Risk

  • How does the system prevent callers from receiving guidance outside the agency’s intended scope?

  • How are conversations logged, retained, and made available for audit or discovery - including RAG retrieval records?

  • Has the system been tested with callers representing our actual population (language diversity, accent variation, disability access)?

  • What disclosure will callers receive that they are interacting with an automated AI system?

  • Does our jurisdiction have any pending or enacted policies governing AI use in government agencies?

 

Summary

 

No single technology is universally appropriate for all government agencies. Touch-tone IVR remains reliable, low-cost, and easy to manage for agencies handling well-defined inquiry types. Speech recognition IVR offers improved caller experience with moderate additional cost and complexity. Conversational AI offers the most flexibility but introduces meaningful new costs - including the ongoing burden of RAG knowledge base maintenance and NLG output monitoring - along with operational requirements and governance obligations that are not always visible at the point of procurement.

 

The table below summarizes the overall profile of each technology.


Factor

Touch-Tone IVR

Speech Recognition IVR

Conversational AI

Overall cost (hard + soft)

Low

Moderate

High, and variable

Cost predictability

High

High

Low — usage-based pricing

Operational burden

Low

Moderate

High

Staffing requirement

Low

Moderate

Moderate to High

Knowledge maintenance

None

Grammar updates

Ongoing RAG knowledge base ownership required

Output risk

Very low (scripted)

Low (bounded)

Moderate — requires active NLG monitoring

Public trust / accountability

Low risk

Low risk

Requires deliberate governance design

Caller experience

Functional but rigid

More flexible

Most flexible

Best suited for

Agencies with defined, high-volume inquiry types and limited ongoing IT capacity

Agencies seeking improved caller experience with controlled risk

Agencies with sufficient staff capacity to manage RAG, monitor NLG outputs, and sustain ongoing governance

Disclaimer

This brief is intended as a factual reference to support deliberation by state and local government leadership. It does not constitute a recommendation to adopt or avoid any specific technology. Agencies should consult with their IT departments, legal counsel, procurement officers, and, where applicable, state oversight bodies before making technology decisions.

 

Relevant frameworks and guidance are available from the National Center for State Courts (ncsc.org), the National Association of State Chief Information Officers (nascio.org), and the Conference of State Court Administrators (cosca.ncsc.org).

Government Technology Brief  -  Prepared for State and Local Government Leaders

 
 
 

Recent Posts

See All

Comments


bottom of page