Highlights:
- Cloudflare claims its software can detect bots attempting to scrape content for LLM training projects, even those trying to evade detection.
- Cloudflare plans to update the feature regularly to adapt to changes in AI scraping bots’ technical patterns and new crawler developments
Recently, Cloudflare Inc. introduced a new no-code feature designed to deter AI developers from scraping content from websites.
The feature is included in the company’s leading CDN and is widely utilized across a significant portion of the world’s websites to enhance page load speeds. Cloudflare has made the new scraping prevention feature accessible in both free and paid versions of its CDN.
Numerous AI firms utilize public web content to train their large language models. While entities like OpenAI and Google LLC allow website operators to opt out of scraping, not all LLM developers offer this choice. This is the challenge that Cloudflare aims to tackle with its scraping prevention tool.
The feature employs artificial intelligence to identify automated attempts to extract content. Cloudflare claims that its software can detect bots attempting to scrape content for LLM training projects, even those trying to evade detection.
“Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent. We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot,” Cloudflare engineers wrote in a blog post recently.
Cloudflare identified a crawler used by Perplexity AI Inc., a well-funded search engine startup, to collect content. A media house reported recently that the bot mimics regular user traffic in its website scraping method, making it challenging for website operators to block Perplexity AI from accessing their content.
Cloudflare assigns a score between 1 and 99 to every website visit processed through its platform. A lower score indicates a higher likelihood of the request being generated by a bot. According to Cloudflare, requests made by the bot collecting content for Perplexity AI consistently receive a score below 30.
Cloudflare’s engineers detailed, “When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint.”
Cloudflare plans to continually update the feature to adapt to evolving technical signatures of AI scraping bots and the emergence of new crawlers. As part of this effort, the company is introducing a tool that allows website operators to report encounters with new bots.