Sites scramble to block ChatGPT web crawler after instructions emerge

UngodlyAudrey🏳️‍⚧️@beehaw.org · 1 year ago

Sites scramble to block ChatGPT web crawler after instructions emerge

The Doctor@beehaw.org · 1 year ago

Very early on, at least, their spiders respected robots.txt.

I know there are folks that have all of the Big G in their robots.txt files on principle, might want to ask them if it works or not.

chameleon@kbin.social · 1 year ago

I do and I can confirm there are no requests (except for robots.txt and the odd /favicon.ico). Google sorta respects robots.txt. They do have a weird gotcha though: they still put the URLs in search, they just appear with an useless description. Their suggestion to avoid that can be summarized as: don’t block us, let us crawl and just tell us not to use the result, just trust us! when they could very easily change that behavior to make more sense. Not a single damn person with Google blocked in robots.txt wants to be indexed, and their logic on password protecting kind of makes sense but my concern isn’t security, it’s that I don’t like them (or Bing or Yandex).

Another gotcha I’ve seen linked is that their ad targeting bot for Google AdSense (different crawler) doesn’t respect a * exclusion, but that kind of makes sense since it will only ever visit your site if you place AdSense ads on it.

And I suppose they’ll train Bard on all data they scraped because of course. Probably no way to opt out of that without opting out of Google Search as well.

The Doctor@beehaw.org · 1 year ago

Now that’s a dirty trick.