Highlights:

  • Molmo was evaluated internally at the Allen Institute for AI using eleven benchmark tests in comparison to multiple proprietary large language models.
  • Text processing tasks are the main emphasis of the two models in the Llama 3.2 series.

Allen Institute for AI launched Molmo, a suite of open-source language models to process images and text content.

Meta Platforms Inc.’s Connect 2024 product event provided a setting for the launch. The company unveiled new mixed reality gadgets as well as its own open-source Llama 3.2 language model series. Similar to Molmo’s offerings, two of the models in the lineup incorporate multimodal processing features.

Machine learning research is the main emphasis of the nonprofit Allen Institute for AI, or Ai2. It is situated in Seattle. It exhibits four neural networks in the new Molmo model series. There are 72 billion parameters in the most sophisticated model, one billion in the hardware-efficient model, and seven billion in each of the other two models.

Not only do all four algorithms respond to natural language requests, but they also have multimodal processing capabilities. Molmo is able to recognize, count, and characterize things in a picture. In addition, the models can carry out associated activities such as providing an explanation of the data displayed in a chart.

Molmo was evaluated internally at the Allen Institute for AI using eleven benchmark tests in comparison to multiple proprietary large language models. A score of 81.2 was obtained by the Molmo version with 72 billion parameters, marginally surpassing OpenAI’s GPT-4o. Less than five points separated the two Molmo versions with seven billion parameters from the OpenAI model.

With one billion parameters, the smallest model in the series has processing power that is more constrained. However, according to the Allen Institute for AI, it can still perform better than some systems that have ten times as many parameters. The model is also small enough to function on a mobile device.

The dataset used for training the Molmo series is one factor in its processing power. The file had several hundred thousand photos, each with an extremely thorough explanation of the subjects it featured. The Allen Institute for AI claims that by examining those descriptions, Molmo was able to outperform larger models trained on lower-quality data on item recognition challenges.

“We take a vastly different approach to sourcing data with an intense focus on data quality, and are able to train powerful models with less than 1M image text pairs, representing 3 orders of magnitude less data than many competitive approaches,” Molmo developers reported.

The event featured the release of Llama 3.2, a new family of Meta language models, and the introduction of the algorithm series. The roster includes four open-source neural networks, just like Molmo.

The first two models house nine billion and 11 billion parameters, respectively. Their multimodal architecture enables them to process pictures in addition to words. According to Meta, the models are equally accurate at image recognition tests as GPT4o-mini, which is a condensed form of GPT-4o.

Text processing is the focus of the other two models in the Llama 3.2 series. With three billion parameters, the more sophisticated of the two has roughly a third as many as the other. According to Meta, the models can perform better on a variety of tasks than algorithms of similar size.