OpenAI, the maker of machine learning models trained on public web data, has published the specifications for its web crawler so that publishers and site owners can opt out of having their content scraped.
The newly released technical document describes how to identify OpenAI’s web crawler GPTBot through its user agent token and string, which get emitted by the company’s software in the HTTP request header sent to ask a server for a web page.
Web publishers can thus add an entry into their…
Read the full article here