Advanced AI Models Struggle with Complex Planning, Study Reveals Critical Weaknesses

October 20, 2024

A recent study highlights that advanced AI language models, such as OpenAI's o1-preview, face significant challenges with complex planning tasks.
Conducted by researchers from Fudan University, Carnegie Mellon University, ByteDance, and Ohio State University, the study utilized two benchmarks: BlocksWorld and TravelPlanner.
In the BlocksWorld benchmark, models like o1-mini and o1-preview excelled, achieving nearly 100% accuracy, while most other models fell below 50%.
However, in the more intricate TravelPlanner scenario, all models struggled, with GPT-4o managing only a 7.8% success rate and o1-preview slightly better at 15.6%.
Other models, including GPT-4o-Mini, Llama3.1, and Qwen2, scored between 0 and 2.2%, revealing a stark contrast to human-level planning capabilities.
The researchers pinpointed two critical weaknesses in AI planning: inadequate integration of rules and conditions, and a tendency to lose focus on the original problem as planning time extends.
To tackle these issues, parametric memory updates were introduced to enhance task influence on planning, but they did not mitigate the decline in effectiveness with longer plans.
Although both improvement strategies showed some promise, they ultimately failed to resolve the core challenges in AI planning.
The team employed a technique known as 'Permutation Feature Importance' to evaluate how different input components affected the planning process.
Additionally, episodic memory updates were tested, which improved the models' understanding of constraints but did not facilitate a thorough consideration of individual rules.
The code and data from this comprehensive study will soon be accessible on GitHub, allowing for further exploration and development.

Summary based on 1 source

Get a daily email with more AI stories

Source

THE DECODER • Oct 20, 2024

Even OpenAI's o1-preview fails at travel planning

Advanced AI Models Struggle with Complex Planning, Study Reveals Critical Weaknesses

Get a daily email with more AI stories

Source

More Stories