Highlights:

  • Developers must indicate how many questions and answers the API should provide after uploading the sample data.
  • Early in the upcoming year, Databricks intends to offer a number of API improvements.

Databricks Inc. launched an application programming interface (API) to help users generate synthetic data for machine learning (ML) projects.

The API is accessible through the company’s flagship data lakehouse tool, Mosaic AI Agent Evaluation. The tool aids developers in evaluating artificial intelligence applications’ latency, cost, and output quality. Mosaic AI Agent Evaluation, which makes retrieval-augmented generation easier to deploy, was released in June together with Mosaic AI Agent Framework.

Information created with AI specifically for the development of neural networks is known as synthetic data. Compared to manually building training datasets, our method is significantly faster and more economical. The goal of Databricks’ new API is to create question and answer collections, which are helpful for creating applications that leverage large language models.

There are three steps involved in creating a dataset using the API.

In order for their AI application to function, developers must first upload a frame, or collection of files, including business information pertinent to the task. Frames need to be in a format that Pandas or Apache Spark support. Pandas is a well-known analytics tool for the Python programming language, while Spark is the open-source data processing engine that powers Databricks’ platform.

Developers must indicate how many questions and answers the API should provide after uploading the sample data. They can optionally offer more guidance to alter the result of the API. A software team can define the end users who will interact with the AI application, the task for which the questions will be used, and the style in which the questions should be generated.

The output of an AI model may be of lower quality if the training data is inaccurate. Therefore, before supplying a synthetic dataset to a neural network, businesses frequently have subject matter experts check it for inaccuracies. According to Databricks, the API was created in a way that makes this step of the process easier.

“Importantly, the generated synthetic answer is a set of facts that are required to answer the question rather than a response written by the LLM,” Databricks engineers reported. “This approach has the distinct benefit of making it faster for an SME to review and edit these facts vs. a full, generated response.”

Early in the upcoming year, Databricks intends to offer a number of API improvements. Reviewers of datasets will be able to add more pairs if needed and more quickly evaluate question-answer pairs for problems due to a new graphical interface. Databricks will also include a tool for monitoring the evolution of a business’s synthetic datasets.