Major websites are blocking AI crawlers from accessing their content
Nearly 20% of the top 1000 websites in the world are blocking crawler bots that gather web data for AI services, according to new data from Originality.AI, an AI content detector.
Why it matters: In the absence of clear legal or regulatory rules governing AI’s use of copyrighted material, websites big and small are taking matters into their own hands.
Driving the news: OpenAI introduced its GPTBot crawler early in August, declaring that the data gathered “may potentially be used to improve future models,” promising that paywalled content would be excluded and instructing websites in how to bar the crawler.
- Soon after, several high-profile news sites, including the New York Times, Reuters and CNN, began blocking GPTBot, and many more have since followed. (Axios is among them.)
By the numbers: Of the 1000 most visited websites in the world, the number of sites blocking OpenAI’s ChatGPT bot has increased from 9.1% on Aug 22 to 12% on Aug 29, per Originality.AI’s data.
- The biggest sites blocking ChatGPT’s bot are Amazon, Quora and Indeed. Bigger websites are more likely to have already blocked AI bots, the data shows.
- The Common Crawl Bot — another crawler that regularly gathers web data used by some AI services — is being blocked 6.77% of the time across the top 1000 sites.
How it works: Any page you can access from a web browser can also be “scraped” by a crawler — which operates just like a browser but stores the material in a database instead of displaying it to a user.
- That’s how search engines like Google gather their information.
- Site owners have always had the ability to post instructions that tell these crawlers to go away — but cooperation is strictly voluntary, and bad actors can ignore the instructions.
The big picture: Google and other web firms see their data crawlers’ work as fair use, but many publishers and intellectual property holders have long objected, and the company has faced multiple lawsuits over the practice.
- The rise of large language models and generative AI has pushed this question back into the spotlight, as AI companies send out their own crawlers to collect data to train their models and provide fodder for their chatbots.
Reality check: Some publishers saw at least some value in letting search crawlers access their sites since Google and other search sites sent users to their ad-supported sites.
- But in the AI era, publishers are more aggressively blocking crawlers because there’s no upside, for now, in handing over their data to AI companies.
- Many media companies are currently in talks with AI firms about licensing their data to AI companies for a fee, but those talks are in early stages.
- In the interim, some websites and intellectual property holders are taking or considering legal action against AI companies that may have used their data without permission.
Our thought bubble: Media outfits that feel they got taken by Google over the past two decades are eyeing the rapid commercialization of AI services like OpenAI with hostility and a “we won’t get fooled again” attitude.
- OpenAI is reportedly on track to bring in more than $1 billion in revenue over the next year, per The Information.
Zoom in: News companies specifically are struggling to find the right balance between embracing AI and resisting it.
- On one hand, the industry is desperate to find innovative ways to improve profit margins in their labor-intensive business.
- On the other, introducing AI into a newsroom’s workflow, at a time when trust in news companies is at a historic low, presents challenging ethical questions.
What to watch: If too much of the web blocks AI crawlers, their owners could find it harder to refine and update their AI products — and good data is getting tougher to find.
- Originality.AI found that the rate of blocking the GPTBot among the top 1000 websites is increasing roughly 5% per week.