Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
- Minjie Zhu1,*
- Yichen Zhu2*
- Jinming Li3
- Junjie Wen1
- Zhiyuan Xu2
- Zhengping Che2
- Chaomin Shen1
- Yaxin Peng3
- Dong Liu2
- Feifei Feng2
- Jian Tang2
1 School of Computer Science, East China Normal University, China
2 Midea Group, China
3 Department of Mathematics, School of Science, Shanghai University, China
* Equal contribution. This work was done during Minjie Zhu, Jinming Li, and Junjie Wen’s internship at Midea Group.
Abstract
The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual-process theory in cognitive science—which suggests two parallel systems of fast and slow thinking in human decision-making—we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user's instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision-language model aligned with the policy networks, which allow the robot to recognize user's intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning.
Framework
Dual-process model research indicates that individuals engage with decisions in two primary ways: a rapid, instinctive, subconscious manner (referred to as “System 1 or Fast-thinking”) and a measured, deliberate, conscious manner (“System 2 or Slow-thinking”). Based on this theory, we propose a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Upon receiving an instruction, the robot processes it through DistilRoBERTa to obtain an embedding. Leveraging embedding similarity search, we classified the instruction into either a fast-thinking system or a slow-thinking system. The framework is shown in below.
Experiments
We empirically assess the broad applicability of RFST across diverse tasks in both simulated and real-world settings.
Experiments on simulator
Success rates on VIMA-Bench over six tasks. The Tasks 1 and 2 belong to fast-thinking system, and Task 3-6 belong to slow-thinking system. Our proposed RFST significantly outperforms other methods in accomplishing slow-thinking tasks, achieving notably higher success rates.
Experiments on real world
The experiments on the real robot. Orange Bars: Slow-thinking tasks. Blue Bars: Fast-thinking tasks. RFST empowers real robots to execute complex tasks such as mathematical reasoning and intent recognition, which were traditionally beyond the scope of conventional robotic manipulation techniques.