Advanced AI Models Struggle with Complex Planning, Study Reveals Critical Weaknesses

October 21, 2024
Advanced AI Models Struggle with Complex Planning, Study Reveals Critical Weaknesses
  • A recent study highlights that advanced AI language models, such as OpenAI's o1-preview, face significant challenges with complex planning tasks.

  • Conducted by researchers from Fudan University, Carnegie Mellon University, ByteDance, and Ohio State University, the study utilized two benchmarks: BlocksWorld and TravelPlanner.

  • In the BlocksWorld benchmark, models like o1-mini and o1-preview excelled, achieving nearly 100% accuracy, while most other models fell below 50%.

  • However, in the more intricate TravelPlanner scenario, all models struggled, with GPT-4o managing only a 7.8% success rate and o1-preview slightly better at 15.6%.

  • Other models, including GPT-4o-Mini, Llama3.1, and Qwen2, scored between 0 and 2.2%, revealing a stark contrast to human-level planning capabilities.

  • The researchers pinpointed two critical weaknesses in AI planning: inadequate integration of rules and conditions, and a tendency to lose focus on the original problem as planning time extends.

  • To tackle these issues, parametric memory updates were introduced to enhance task influence on planning, but they did not mitigate the decline in effectiveness with longer plans.

  • Although both improvement strategies showed some promise, they ultimately failed to resolve the core challenges in AI planning.

  • The team employed a technique known as 'Permutation Feature Importance' to evaluate how different input components affected the planning process.

  • Additionally, episodic memory updates were tested, which improved the models' understanding of constraints but did not facilitate a thorough consideration of individual rules.

  • The code and data from this comprehensive study will soon be accessible on GitHub, allowing for further exploration and development.

Summary based on 1 source


Get a daily email with more AI stories

Source

Even OpenAI's o1-preview fails at travel planning

More Stories