Connect with us

Hi, what are you looking for?

Business

Why asking AI ‘are you sure?’ rarely works

It feels like common sense to ask an AI, “Are you sure?” but new research from Telus Digital, however, suggests that instinct won’t always give you the result you want.

Photo by Christina Morillo on Pexels
Photo by Christina Morillo on Pexels
Photo by Christina Morillo on Pexels

It feels like common sense to ask an AI tool, “Are you sure?”

Press a little. Get a better answer. Carry on with your task slightly more confident.

New research from Telus Digital, however, suggests that instinct won’t always give you the result you want. 

In a poll of 1,000 U.S. adults who regularly use AI tools, 60% said they have asked a follow-up question such as “Are you sure?” Only 14% said the AI changed its response. Of that group: 

  • 25% felt the new answer was more accurate.
  • 40% said the response felt the same as the original. 
  • 26% couldn’t identify which response was correct.
  • 8% said it was less accurate.

For business leaders building AI into workflows, customer service, or decision-making tools, prompting alone is not a reliability strategy. These findings show the critical importance of high-quality training data and evaluation needed before deployment.

When confidence and correctness diverge

Not only is asking AI “Are you sure?” a common sense move, it feels like the responsible thing to do.

Telus Digital’s findings pair well with an additional report they released this month, titled Certainty robustness: Evaluating LLM stability under self-challenging prompts. The paper outlines controlled testing across four leading large language models: GPT-5.2 (OpenAI), Gemini 3 Pro (Google), Claude Sonnet 4.5 (Anthropic), and Llama-4 (Meta). 

Researchers created a Certainty Robustness Benchmark, with 200 math and reasoning questions. Each question had a single correct answer, but the benchmark measured if and how often each model defended correct answers and self-corrected wrong answers when faced with prompts like “Are you sure?” “You are wrong,” and “Rate how confident you are in your answer.”  

The results varied by model. 

Google’s Gemini 3 Pro largely maintained correct answers and selectively corrected some mistakes. OpenAI’s GPT-5.2 was more likely to change its responses when questioned, including switching correct answers to incorrect ones on occasion. Claude Sonnet 4.5 often maintained its original response, while Llama-4 showed modest improvement but struggled to distinguish when to revise.

Overall, researchers found that follow-up prompts were not reliable at improving accuracy, and sometimes even reduced it.

“What stood out to us was how closely the poll respondents’ experiences matched our controlled testing,” says Steve Nemzer, director of AI growth and innovation at Telus Digital. 

“Our poll shows that many people fact-check AI through other sources, but this doesn’t reliably improve accuracy. Our research explains why.” 

Nemzer added that while today’s AI systems are meant to be helpful and responsive by design, a natural understanding of certainty and truth is never going to be there. 

“As a result, some models change correct answers when challenged, while others will stick with wrong ones,” he explains. “Real reliability comes from how AI is built, trained and tested, not leaving it to users to manage.”

Consider, though, how people behave. 

Eighty-eight percent of respondents said they have seen AI make mistakes. Yet only 15% always fact-check responses, 30% usually do, 37% sometimes do, and 18% rarely or never fact-check.

In high-stakes environments, that gap matters. When decisions affect money, customers, compliance, or reputation, assuming someone will double-check is not a risk strategy.

Shared responsibility has limits

Despite this gap, most respondents believe they have a role to play: 

  • 69% say it is their responsibility to fact-check important information before making decisions or sharing it. 
  • 57% say they should use “appropriate” judgment about AI, particularly for medical, legal, or financial matters.

While that sense of responsibility is healthy, it also raises a practical question for organizations. 

How much should you expect end users to compensate for system weaknesses?

If your customer support team is relying on AI-generated responses, or analysts are using generative tools to draft reports, the burden can’t sit entirely with individuals to catch errors. 

In production environments, reliability must be baked in.

Trust is not a prompt strategy

The broad push to normalize AI adoption across industries depends on trust. Decision-makers care about risk, governance, and operational resilience. 

Simply challenging an AI output does not automatically improve it. (Or, in Lord of the Rings-speak, “One does not simply challenge an AI output. Its Black Gates are guarded by more than just challenges.”)

For leaders, three actions stand out.

First, invest in training data quality and evaluation before deployment. The research underscores that reliability is shaped upstream, not at the prompt layer. AI systems need to learn from contextually-rich data.

Second, define clear use cases. If employees are expected to use AI for drafting, summarizing, or analysis, clarify where human review is mandatory and where automation is acceptable.

Third, measure model behaviour under pressure. The “are you sure?” test is simple, but it exposes how models respond to doubt. That dynamic can influence customer experience, compliance, and internal decision-making.

Innovation means understanding where systems are strong, where they are brittle, and how they interact with human judgment. Blind trust and AI aren’t exactly buddy-buddy.

As more Canadian enterprises integrate AI into core operations, the conversation needs to move toward system design. A competitive edge goes to the one with a better foundation.

Final shots

  • If your AI strategy depends on employees typing “are you sure?” you do not have a strategy. Reliability has to be built in before the tool reaches your team or your customers.
  • Most people know AI makes mistakes, but few consistently check it. That gap is where risk lives, and leaders can’t assume human vigilance will close it.
  • Enterprises must invest in high-quality, context-rich annotated data, human-in-the-loop processes, and subject matter expertise. 
  • The national AI conversation should focus on operational trust. That is where long-term advantage will be built.
Avatar photo
Written By

Jennifer Kervin is a Digital Journal staff writer and editor based in Toronto.

You may also like:

Business

Digital Journal dives into new findings from a survey of automotive dealers re: their sentiment on fraud.

Tech & Science

OpenClaw, created in November by an Austrian coder, differs from bots like ChatGPT because it can execute real-life tasks.

Business

Why C-suite leaders who last rely less on brilliance and more on adaptability

Tech & Science

EU nations backed a ban on AI systems generating sexualised deepfakes, after an outcry over such images produced by Musk's Grok.