Google has recently unveiled a feature called Google-Extended, which allows website publishers to opt out of having their data utilized for training Google’s AI models. This new tool enables sites to continue being crawled and indexed by Google’s web crawlers while preventing their data from being employed in the ongoing development of AI models, including Bard and Vertex AI generative APIs.
This move grants web publishers the ability to exercise control over their content accessibility via a toggle switch. Google had previously confirmed that it was training its AI chatbot, Bard, using publicly available web-scraped data.
To implement Google-Extended, publishers can utilize the robots.txt file, which instructs web crawlers about site access permissions. Google mentions that it will explore additional machine-readable methods to offer more choices and control for web publishers as AI applications continue to expand, with more details forthcoming.
Many websites have already taken measures to block web crawlers used by entities like OpenAI to scrape data for training purposes, with notable examples including The New York Times, CNN, Reuters, and Medium. However, there have been challenges in blocking Google completely, as doing so would result in these sites not being indexed in Google search results. Consequently, some websites, like The New York Times, have resorted to legal measures by updating their terms of service to prohibit companies from using their content for AI training.