Highlights:

  • Anthropic’s initiative arises from increasing critiques of current AI model benchmark tests, including the biannual MLPerf evaluations conducted by nonprofit organization MLCommons.
  • Anthropic aims to inspire the AI research community to develop more rigorous benchmarks that emphasize societal implications and security considerations.

Anthropic PBC, a generative artificial intelligence startup, aims to demonstrate that its large language models are industry-leading. To achieve this, Anthrophic has introduced a new program designed to encourage researchers to develop innovative benchmarks for assessing AI performance and impact.

A recent blog post announced the new program, in which the company stated its readiness to offer grants to third-party organizations that can develop improved methods for “measuring advanced capabilities in AI models.”

Anthropic’s initiative arises from increasing criticism of current benchmark tests for AI models, such as the MLPerf evaluations conducted biannually by the nonprofit MLCommons. There is a general consensus that the most widely used benchmarks inadequately reflect how the average person interacts with AI systems in daily life.

For example, most benchmarks are too narrowly focused on individual tasks, whereas AI models like Anthropic’s Claude and OpenAI’s ChatGPT are designed to handle a variety of tasks. Additionally, there is a shortage of effective benchmarks that can evaluate the potential risks posed by AI.

Anthropic aims to inspire the AI research community to create more rigorous benchmarks that emphasize societal implications and security concerns. The company advocates for a comprehensive overhaul of current methodologies.

“Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem. Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply,” the company stated.

For instance, the startup expressed interest in developing a benchmark capable of more effectively evaluating an AI model’s potential for malicious activities, such as cyberattacks, manipulation or deception of individuals, and enhancement of weapons of mass destruction. It aims to contribute to the creation of an “early warning system” to identify potentially risky models that could pose national security threats.

It also seeks more specialized benchmarks to assess AI systems’ capabilities in supporting scientific research, reducing inherent biases, self-regulating toxic content, and engaging in multilingual conversations.

The company envisions this process involving the development of new tools and infrastructure that empower domain experts to design tailored evaluations for specific tasks. These evaluations would then undergo extensive testing involving hundreds or even thousands of users. To kick-start these efforts, the company has hired a full-time program coordinator. In addition to offering grants, it plans to facilitate discussions between researchers and its own domain experts, including red team, fine-tuning, and trust and safety teams.

Moreover, the company mentioned that it might consider investing in or acquiring the most promising projects that emerge from the initiative. The company said, “We offer a range of funding options tailored to the needs and stage of each project.”

Anthropic is not alone among AI startups advocating for the adoption of more advanced benchmarks. Just last month, Sierra Technologies Inc. introduced a new benchmark test named “𝜏-bench.” This benchmark is specifically designed to assess the capabilities of AI agents, which extend beyond mere conversational engagement to performing tasks on behalf of users upon request.

However, there are reasons to approach any AI company aiming to establish new benchmarks with caution. It’s evident that there are potential commercial advantages if these tests can be used to demonstrate the superiority of their AI models over others.

Regarding Anthropic’s initiative, the company stated in its blog post that it seeks alignment between researchers’ benchmarks and its own AI safety classifications, which it developed in collaboration with third-party AI researchers. Consequently, there is concern that AI researchers may feel pressured to adopt AI safety definitions that do not necessarily align with their own perspectives.

Nevertheless, Anthropic maintains that the initiative is intended to stimulate advancement throughout the broader AI industry, setting the stage for a future where more thorough evaluations become standard practice.