Language-Conditioned Robotic Manipulation Fast and Slow Thinking

Language-Conditioned Robotic Manipulation
with Fast and Slow Thinking

Minjie Zhu^1,*

Yichen Zhu^2*

Jinming Li³

Junjie Wen¹

Zhiyuan Xu²

Zhengping Che²

Chaomin Shen¹

Yaxin Peng³

Dong Liu²

Feifei Feng²

Jian Tang²

¹ School of Computer Science, East China Normal University, China

² Midea Group, China

³ Department of Mathematics, School of Science, Shanghai University, China

* Equal contribution. This work was done during Minjie Zhu, Jinming Li, and Junjie Wen’s internship at Midea Group.

Paper

Video

Abstract

The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual-process theory in cognitive science—which suggests two parallel systems of fast and slow thinking in human decision-making—we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user's instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision-language model aligned with the policy networks, which allow the robot to recognize user's intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning.

Framework

Dual-process model research indicates that individuals engage with decisions in two primary ways: a rapid, instinctive, subconscious manner (referred to as “System 1 or Fast-thinking”) and a measured, deliberate, conscious manner (“System 2 or Slow-thinking”). Based on this theory, we propose a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Upon receiving an instruction, the robot processes it through DistilRoBERTa to obtain an embedding. Leveraging embedding similarity search, we classified the instruction into either a fast-thinking system or a slow-thinking system. The framework is shown in below.

Experiments

We empirically assess the broad applicability of RFST across diverse tasks in both simulated and real-world settings.

Experiments on simulator

Success rates on VIMA-Bench over six tasks. The Tasks 1 and 2 belong to fast-thinking system, and Task 3-6 belong to slow-thinking system. Our proposed RFST significantly outperforms other methods in accomplishing slow-thinking tasks, achieving notably higher success rates.

Experiments on real world

The experiments on the real robot. Orange Bars: Slow-thinking tasks. Blue Bars: Fast-thinking tasks. RFST empowers real robots to execute complex tasks such as mathematical reasoning and intent recognition, which were traditionally beyond the scope of conventional robotic manipulation techniques.

Citation

@article{zhu2024language, title={Language-Conditioned Robotic Manipulation with Fast and Slow Thinking}, author={Zhu, Minjie and Zhu, Yichen and Li, Jinming and Wen, Junjie and Xu, Zhiyuan and Che, Zhengping and Shen, Chaomin and Peng, Yaxin and Liu, Dong and Feng, Feifei and others}, journal={arXiv preprint arXiv:2401.04181}, year={2024} }

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

1 School of Computer Science, East China Normal University, China

2 Midea Group, China

3 Department of Mathematics, School of Science, Shanghai University, China

* Equal contribution. This work was done during Minjie Zhu, Jinming Li, and Junjie Wen’s internship at Midea Group.

Paper

Video

Abstract

Framework

Experiments

Experiments on simulator

Experiments on real world

Citation

Language-Conditioned Robotic Manipulation
with Fast and Slow Thinking

¹ School of Computer Science, East China Normal University, China

² Midea Group, China

³ Department of Mathematics, School of Science, Shanghai University, China