Artificial intelligence has advanced rapidly in recent years, with large language models (LLMs) like GPT-3 and DALL-E2 demonstrating impressive natural language and image generation capabilities. This has led to enthusiasm that LLMs may also excel at reasoning tasks like planning, logic, and arithmetic. However, a new study casts doubt on LLMs' ability to reliably self-critique and iteratively improve their reasoning, specifically in the context of AI planning.
In the paper "Can Large Language Models Improve by Self-Critiquing Their Own Plans?" by researchers at Arizona State University, the authors systematically tested whether having an LLM critique its candidate solutions enhances its planning abilities. Their results reveal limitations in using LLMs for self-verification in planning tasks.
Understanding AI Planning
To appreciate the study's findings, let's first understand the AI planning problem. In classical planning, the system is given:
- A domain describing the predicates and actions
- An initial state
- A goal state
The aim is to find a sequence of actions (a plan) that transforms the initial state into the goal state when executed. For example, in a Blocks World domain, the actions may involve picking up, putting down, stacking, or unstacking blocks.
The Study Methodology
The researchers created a planning system with two components:
- A generator LLM that proposes candidate plans
- A verifier LLM that checks if the plan achieves the goals
Both roles used the same model, GPT-4. If the verifier found the plan invalid, it would give feedback to prompt the generator to create a new candidate plan. This iterative process continued until the verifier approved a plan or a limit was reached.
The team compared this LLM+LLM system against two baselines:
- LLM + External Verifier: GPT-4 generates plans verified by a proven, reliable planner called VAL.
- LLM alone: GPT-4 generates plans without critiquing or feedback.
Self-Critiquing Underperforms External Verification
On a classical Blocks World benchmark, the LLM+LLM system solved 55% of problems correctly. The LLM+VAL system scored significantly higher with 88% accuracy. The LLM-only method trailed at 40%. This suggests that self-critiquing could have enhanced the LLM's planning capabilities. The researchers attribute the underperformance mainly to the LLM verifier's poor detection of invalid plans.
High False Positive Rate from LLM Verifier
Analysis revealed the LLM verifier incorrectly approved 38 invalid plans as valid. This 54% false positive rate indicates the verifier cannot reliably determine plan correctness. Flawed verification compromises the system's trustworthiness for planning applications where safety is paramount. In contrast, the external verifier VAL produced exact plan validity assessments. This emphasizes the importance of sound, logical verification over LLM self-critiquing.
Feedback Granularity Didn't Improve Performance
The researchers also tested whether more detailed feedback on invalid plans helps the LLM generator create better subsequent plans. However, binary feedback indicating only plan validity was as effective as highlighting specific plan flaws.
This suggests the LLM verifier's core limitation is in binary validity assessment rather than feedback depth. Even if the verifier provided the perfect invalid plan critiques, it needs help to identify flawed plans in the first place correctly.
The Future of AI Planning Systems
This research provides valuable evidence that self-supervised learning alone may be insufficient for LLMs to reason about plan validity reliably. Hybrid systems combining neural generation with logical verification seem most promising. The authors conclude, "Our systematic investigation offers compelling preliminary evidence to question the efficacy of LLMs as verifiers for planning tasks within an iterative, self-critiquing framework."
The study focused on planning, but the lessons likely extend to other reasoning domains like mathematics, logic, and game strategy. We should temper our expectations about unaided LLMs successfully self-reflecting on such complex cognitive tasks.
How CPROMPT Can Help
At CPROMPT.AI, we follow developments in self-supervised AI as we build a platform enabling everyone to create AI apps. While LLMs are exceptionally capable in language tasks, researchers are still exploring how best to integrate them into robust reasoning systems. Studies like this one provide valuable guidance as we develop CPROMPT.AI's capabilities.
If you're eager to start building AI apps today using prompts and APIs, visit CPROMPT.AI to get started for free! Our user-friendly interface allows anyone to turn AI prompts into customized web apps in minutes.
FAQ
Q: What are the critical limitations discovered about LLMs self-critiquing their plans?
The main limitations were a high false positive rate in identifying invalid plans and failure to outperform planning systems using external logical verification. Detailed feedback could have significantly improved the LLM's planning performance.
Q: What is AI planning, and what does it aim to achieve?
AI planning automatically generates a sequence of actions (a plan) to reach a desired goal from an initial state. It is a classic reasoning task in AI.
Q: What methods did the researchers use to evaluate LLM self-critiquing abilities?
They compared an LLM+LLM planning system against an LLM+external verifier system and an LLM-only system. This assessed both the LLM's ability to self-critique and the impact of self-critiquing on its planning performance.
Q: Why is self-critiquing considered difficult for large language models?
LLMs are trained mainly to generate language rather than formally reason about logic. Self-critiquing requires assessing if complex plans satisfy rules, preconditions, and goals, which may be challenging for LLMs' current capabilities.
Q: How could LLMs meaningfully improve at critiquing their plans?
Potential ways could be combining self-supervision with logic-driven verification, training explicitly on plan verification data, and drawing lessons from prior AI planning research to inform LLM development.
Glossary
- AI planning: Automatically determining a series of actions to achieve a goal
- Classical planning: Planning problems with predefined actions, initial states, and goal states
- Verifier: Component that checks if a candidate plan achieves the desired goals
- False positive: Incorrect classification of an invalid plan as valid