The rise of large language models (LLMs) like ChatGPT has sparked speculation about whether AI is on the cusp of human-like reasoning abilities. But while these chatbots can hold fluent conversations and generate coherent text, some AI researchers argue their reasoning capabilities are an illusion.
In a new paper, researchers at Arizona State University systematically probe whether iterative prompting - asking an LLM to refine its answers repeatedly - can improve performance on logical reasoning tasks. Their results reveal limitations in LLMs' ability to self-critique their solutions.
Down the Rabbit Hole
The researchers chose graph coloring, a classic AI reasoning challenge to test iterative prompting. The goal is to assign colors to each node in a graph so no two adjacent nodes share the same color. It's the kind of constrained optimization problem humans solve using logical deduction.
First, the researchers directly prompted the LLM GPT-4 with 100 random graph coloring problems. Without any feedback or prompting, GPT-4 solved only 16 out of 100 instances correctly. No big surprise there.
Things got more interesting when the researchers tried having GPT-4 critique its solutions. The idea was that GPT-4 could propose a coloring, check it for errors, and then refine its solution based on those errors. But instead of improving, performance plummeted - GPT-4 solved just 1 out of 100 problems this way.
Why the failure? The researchers found GPT-4 needed help to verify its solutions accurately. When it generated a correct coloring, it would falsely claim errors and undo the solution. Like Alice falling down the rabbit hole, iterative self-critiquing led the LLM astray.
External Feedback
Next, the researchers tried having an external program, which was guaranteed to be correct, to double-check GPT-4's solutions. This time, iterative prompting did improve performance to around 40% accuracy.
But here's the twist - the researchers found the feedback content didn't matter. Just telling GPT-4 "try again" worked nearly as well as providing specific error details. This suggests the performance gains come simply from making more attempts, not from GPT-4 internalizing logical critiques.
To confirm this, the researchers tried having GPT-4 generate multiple candidate solutions in one shot without any prompting or verification. With 15 tries, GPT-4 could solve 40% of problems as effectively as the iterative approach.
Off With Their Heads!
Based on these experiments, the researchers conclude today's LLMs have limited ability to analyze and refine their reasoning critically:
"Our results thus call into question claims about the self-critiquing capabilities of state of the art LLMs."
Their findings contribute to a growing debate about overselling the current capabilities of large language models. Unbridled enthusiasm risks misleading the public and policymakers about how far we are from accurate artificial general intelligence.
We are still in the early days of a technology that could ultimately transform society. But separating hype from reality is critical as we chart a responsible path forward. Like Alice's journey through Wonderland, discovering the limits of today's AI systems can orient us to what's actual versus illusion.
The Looking Glass
So, where does this leave us? Here are the key facts and implications of this research:
- LLMs need help to evaluate their solutions to reasoning problems accurately. Self-critiquing does not improve performance.
- External feedback from a correct program enables iterative prompting gains, but the content of the feedback may not matter much.
- Multiple tries without prompts or verification work nearly as well as iterative prompting. This suggests performance gains may come from having more attempts.
The takeaway? We should be cautious about claims that LLMs can refine solutions through logical self-reflection. Their reasoning capabilities remain quite brittle and limited.
Rather than obsessing over whether models are "truly" reasoning, we should focus evaluation on how they perform on social priorities that matter - education, healthcare, science, and governance—moving beyond narrow AI benchmarks toward real-world usefulness.
There's no magic shortcut to artificial general intelligence. But models like ChatGPT hint at the transformative potential of AI while revealing challenges ahead. With eyes open to possibilities and limitations, we can navigate judiciously to an emerging technology that serves humanity.
Try building your own AI assistant on CPROMPT.AI. CPROMPT democratizes AI by letting anyone generate prompt applications through a simple web interface - no coding required. Join the community reshaping the future of AI.