Highlights:

  • According to Meta, ImageBind beats several standard models that focus on only one sort of data. Furthermore, the business believes the neural network will aid in discovering new AI applications.
  • The new ImageBind model from Meta uses a different approach. The business claims it retains many sorts of data in a single embedding rather than individually.

Meta Platforms Inc. has released the code for ImageBind, an artificial intelligence model built internally to analyze six distinct data types.

According to Meta, ImageBind beats several standard models that focus on only one sort of data. Furthermore, the business believes the neural network will aid in discovering new AI applications.

ImageBind can handle photos, text, audio, data from infrared sensors, and depth maps. These are three-dimensional models of objects made using a specialist camera. ImageBind can also read data from IMUs, tracking an object’s position and associated information like its velocity.

Meta researchers stated, “ImageBind is part of Meta’s efforts to create multimodal AI systems that learn from all possible data types around them. As the number of modalities increase, ImageBind opens the floodgates for researchers to develop new, holistic systems, such as combining 3D and IMU sensors to design or experience immersive, virtual worlds.”

The data that AI models consume are stored as mathematical structures known as vectors. An embedding is a vector collection representing the data in an AI’s internal knowledge bank. The approach ImageBind uses to manage such embeddings is the fundamental innovation.

Multimodal models are neural networks that handle various forms of data, such as ImageBind. A multimodal model typically stores each type of data it ingests in a distinct embedding. For example, a neural network that analyzes pictures and text may store images in one embedding and text in another.

The new ImageBind model from Meta uses a different approach. The business claims it retains many sorts of data in a single embedding rather than individually.

Before the advent of ImageBind, it was feasible to save data in this manner. However, engineers had to gather extremely complicated training datasets to integrate the capacity in an AI model. According to Meta, creating such training datasets on a big scale is not practical.

ImageBind simplifies the job. It is centered on self-supervised learning, an approach to machine learning that substantially reduces the effort required to create training datasets. Meta states that ImageBind’s architecture enables it to outperform conventional neural networks in certain circumstances.

During an internal test, the company utilized ImageBind to classify a variety of audio and depth data. Several AI systems intended to process a specific data type outperformed the model. In addition, ImageBind reportedly set a performance record in a test involving “emergent zero-shot recognition” tasks.

According to Meta, another advantage of ImageBind’s embedding architecture is that it facilitates fairly complex computing duties. Specifically, the model is capable of simultaneously analyzing multiple categories of data. ImageBind could, for instance, generate an image of a vehicle based on a design and a textual description.

ImageBind can similarly mix and match its four other supported data formats. Meta believes that support for additional data types may be introduced in the future. The company anticipates that computer scientists will utilize ImageBind to advance multimodal AI research and investigate new applications for the technology.

Meta’s researchers said, “There’s still a lot to uncover about multimodal learning. The AI research community has yet to effectively quantify scaling behaviors that appear only in larger models and understand their applications. ImageBind is a step toward rigorously evaluating them and showing novel applications in image generation and retrieval.”