Apple has published a detailed research paper confirming that its Apple Intelligence models are trained only on data that publishers have licensed or explicitly allowed through standard web crawler permissions. The company stresses that it does not use content scraped without consent and that it honours robots.txt and crawler preference files to respect publisher wishes.

Apple’s data sourcing policy
In its “Apple Intelligence Foundation Language Models – Tech Report 2025” Apple explains that training combines three data streams: licensed publisher content high‑quality open‑source and publicly available datasets collected by Applebot. If a publisher’s crawler preference file disallows access Applebot will not gather content from that site or specified pages.
Licensed and curated content
Apple reports that it approached major publishers in 2023 to licence content for model training with offers in the millions of dollars. Those agreements ensure that publishers receive compensation when their articles help power Apple Intelligence features like summarization translation and generative suggestions.
Respecting publisher preferences
The company’s crawler Applebot adheres to the robots.txt standard and to any crawler preference file a site provides. Publishers that deny access to the Applebot or that designate forbidden routes will find that the systems provided by Apple respect them. This is done in the effort to avoid unsanctioned scraping of sensitive or paywalled content.
Contrast with other AI firms
OpenAI has not actively stated its objections to scraping prohibited material, but publicly it indicates that it will respect robots.txt. An OpenAI blog entry on May 2024 noted that it considers such signals but again did not say it fully complies. According to industry research conducted by TollBit, almost 13 percent of the AI scrapes failed to respect robots.txt altogether in Q1 2025, which is an increase of almost three times in robots.txt disrespect in Q4 2024.
Responsible AI principles in practice
The report presented by Apple gives an understanding of their dedication to the concepts of Responsible AI along the way: starting with the selection of data and going all the way to model testing. It outlines processes of flagging for low-quality or bias prone sources and human review processes in the high-impact situations. The firm also explains on device and private cloud compute structures that are used to secure user information and achievement of privacy.

The open disclosure of how Apple executes its training protocols and the observed rigor in crawler governance placed the company at the top of the ethical AI development. With the controversy raging on about data rights, artificial intelligence and how it will survive, the company establishes itself as an example of a successful hybrid between innovation and regard to the interests of the publishers and legal codes.