Highlights:

  • Google found that Gemini Robotics can not only execute tasks it was not explicitly trained for but can also adapt its approach based on changing environmental conditions.
  • Once a task plan is formulated, Gemini Robotics-ER leverages Gemini 2.0’s coding capabilities to convert it into a configuration script, which programs the robot.

Google LLC introduced Gemini Robotics and Gemini Robotics-ER. These two new artificial intelligence models are designed for powering autonomous machines.

Unveiled in December, the models are developed on the Gemini 2.0 series of large language models. Unlike traditional LLMs that process only text, Gemini 2.0 can handle multimodal data, including video. This functionality enables Gemini Robotics and Gemini Robotics-ER to interpret visual input from a robot’s cameras to inform decision-making.

Google describes Gemini Robotics as a vision-language-action model. Robots using this AI can execute complex tasks based on natural language commands. For instance, a user could instruct the AI to fold paper into origami or place objects in a Ziploc bag.

Traditionally, programming industrial robots for new tasks required manual coding, a time-intensive process demanding specialized expertise. To simplify this, Google designed Gemini Robotics for versatility, allowing it to perform tasks it was not explicitly trained for—minimizing the need for manual programming.

To assess Gemini Robotics’ ability to handle new tasks, Google tested it using an AI generalization benchmark. The results showed that the model outperformed previous vision-language-action models by more than double. Google also found that Gemini Robotics can not only execute tasks it was not explicitly trained for but can also adapt its approach based on changing environmental conditions.

“If an object slips from its grasp, or someone moves an item around, Gemini Robotics quickly replans and carries on — a crucial ability for robots in the real world, where surprises are the norm,” Carolina Parada, Head of robotics at Google DeepMind, detailed.

The second AI model introduced by Google, Robotics-ER, is designed for spatial reasoning—the complex set of calculations a robot performs before executing a task. For instance, picking up a coffee mug requires identifying the handle and determining the optimal approach angle for the robotic arm.

Once a task plan is formulated, Gemini Robotics-ER leverages Gemini 2.0’s coding capabilities to convert it into a configuration script, which programs the robot. If a task is too complex for Gemini Robotics-ER to handle independently, developers can refine its approach by providing just a few human demonstrations.

“Gemini Robotics-ER can perform all the steps necessary to control a robot right out of the box, including perception, state estimation, spatial understanding, planning and code generation,” Parada wrote. “In such an end-to-end setting the model achieves a 2x-3x success rate compared to Gemini 2.0.”

Google plans to offer Gemini Robotics-ER to select partners, including Apptronik Inc., a humanoid robotics startup that recently secured USD 350 million in funding. Google became an investor as part of the financing round to collaborate with Apptronik in the development of humanoid robots powered by Gemini 2.0.