Highlights:
- DatologyAI’s Series A round was spearheaded by Viv Faga and Astasia Myers from Felicis Ventures, with participation from existing investors like Radical Ventures and Amplify Partners, as well as new investors such as Elad Gil, M12 and the Amazon Alexa Fund, according to the company.
- Smaller AI models incur significantly lower compute costs, which is a crucial consideration, mainly because some AI companies spend millions of dollars each month on training and deploying their models.
Artificial intelligence data curation startup DatologyAI recently announced that it has completed a USD 46 million early-stage funding round. DatologyAI’s Series A follows its initial announcement three months ago when it disclosed that it had raised USD 11.65 million in seed funding.
DatologyAI’s Series A round was spearheaded by Viv Faga and Astasia Myers from Felicis Ventures, with participation from existing investors like Radical Ventures and Amplify Partners, as well as new investors such as Elad Gil, M12 and the Amazon Alexa Fund, according to the company. Altogether, DatologyAI has raised nearly USD 57.7 million in funding.
The startup aims to democratize data research to address one of the primary challenges in generative AI development: the requirement for extensive and suitable datasets that inform large language models like OpenAI’s GPT-4 and Google LLC’s Gemini Pro.
DatologyAI offers tools designed to automate a significant portion of the process of creating these datasets. It functions by pinpointing the most relevant information within a dataset tailored to the objectives of the AI model. Moreover, its tools can recommend methods for enhancing existing datasets with supplementary information, determine the optimal approach for batching that information, or segment it into more manageable portions to streamline the model training process.
The startup asserts that creating datasets for generative AI poses challenges because developers must be cautious to ensure their models do not produce toxic content or exhibit biases stemming from the content they are trained on. The issue is that prejudicial patterns may exist in the data, which can be challenging for humans to discern. One contributing factor is that AI training datasets often tend to be vast and intricate, featuring a multitude of formats and abundant noise and extraneous information that may not necessarily enhance the model.
“Models are what they eat, and the data models ingest determines everything about their capabilities,” the company explained in a brief blog post about the recent round.
Founder and Chief Executive Ari Morcos suggests that by employing more efficient training datasets, it becomes feasible to enhance the quality and performance of AI models without necessitating them to be overly large and costly to train and deploy.
Smaller AI models incur significantly lower compute costs, which is a crucial consideration, mainly because some AI companies spend millions of dollars each month on training and deploying their models.
The challenge for AI developers lies in the abundance of information, often leaving them unsure where to start. Instead of navigating this complexity, they may select a random subset of the available data. While this approach may save time and effort, it inevitably leads to the model being trained on redundant data, resulting in slower training times and higher costs. Moreover, some of this data may adversely affect the model’s performance.
DatologyAI offers tools that empower developers to pinpoint the most pertinent information within a dataset. Less relevant data is filtered out, resulting in a more streamlined file comprising higher-quality samples, which are then ready for training.
The company’s toolkit can aid in annotating unlabeled data, a meticulous task often carried out manually. Moreover, it can flag potentially harmful data or anomalies in model behavior.
The startup stated that DatologyAI’s Series A funding round will allow it to “substantially scale the size of our team,” emphasizing adding more engineers and researchers. It also wants to boost its compute output to “push the frontier of what is possible with data curation.”