Highlights:

  • A robot powered by Artificial Intelligence (AI) and trained on a multimodal embodied visual-language model with far more than 562 billion parameters was unveiled this week by Google LLC and the Technical University of Berlin researchers.
  • PaLM-E functions by observing its immediate surroundings through the robot’s camera and can do so without using any scene representation that has been previously processed.

A robot powered by Artificial Intelligence (AI) and trained on a multimodal embodied visual-language model with more than 562 billion parameters was unveiled this week by Google LLC and the Technical University of Berlin researchers.

The robot can perform various tasks based on human voice commands thanks to PaLM-E, a model that integrates AI-powered vision and language to enable autonomous robotic control. This eliminates the need for ongoing retraining. In other words, it’s a robot that can comprehend what is being requested and then go ahead and complete those tasks right away.

For instance, if the robot is instructed to “bring the chips from the drawer,” PaLM-E will immediately devise a plan of action based on the instruction and its field of vision. The mobile robot platform will autonomously act using a controlled robotic arm.

PaLM-E functions by observing its immediate surroundings through the robot’s camera and can do so without using any scene representation that has been previously processed. It merely looks, takes in what it sees, and determines what it must do. Therefore, there is no need for a person to first annotate the visual data.

PaLM-E can respond to changes in the environment as it performs a task, according to Google’s researchers. For instance, if the robot goes to fetch the chips and someone else takes them from it and puts them on a table in the room, the robot will notice what happened, look for them, grab them, and then deliver them to the person who initially asked for them.

Based on the existing PaLM large language model, which integrates sensory data and robotic control, PaLM-E is called an “embodied visual-language model.” It operates by making ongoing observations of its surroundings and encoding this data into a series of vectors, much like it does with words as “language tokens.” This enables it to comprehend sensory data, like how it understands vocal commands.

According to the researchers, PaLM-E can “positively transfer” knowledge and skills from one task to another, outperforming single-task robot models in performance. It also exhibits “multimodal chain-of-thought reasoning,” which means it can evaluate a series of inputs, including language and visual inputs and “multi-image inference.” According to the researchers, it uses multiple images to make an inference or predict something.

Overall, PaLM-E represents a significant advance in autonomous robotics. Google stated that its next steps would be to investigate other applications in practical contexts like home automation and industrial robotics. The researchers also hoped their work would stimulate additional investigation into embodied AI and multimodal reasoning.