Thursday, August 8, 2025

How LLMs Like ChatGPT Really Crawl the Web And Cache Content: An AI Agent Behavior Experiment

Sherin Thomas (boatbuilder)

Illustration of an AI robot examining a website through a magnifying glass, observing only a reflected version to symbolize cached content access

When you ask ChatGPT or Perplexity about a website, what happens behind the scenes? Do these AI tools visit your site every single time, or are they pulling from some cache? To find out we ran an experiment and it reveals the surprising truth about how these AI agents actually behave when crawling and retrieving web content. Note that this study was to find out crawling and traffic behavior of AI agents and bots. Human users coming from LLMs like ChatGPT can be identified with referral headers or utm parameters. We are not discussing that here.

What we found was that both ChatGPT and Perplexity frequently cite websites without actually visiting them, relying instead on cached content. Even more intriguing, they sometimes masquerade as regular browsers rather than identifying themselves as AI bots.

The Experiment: A fake e-commerce store

To understand AI crawling behavior without interference from real-world variables, we created an e-commerce website populated entirely with invented product names —unique tokens that nobody would ever search for, organically. Think of products with names so unusual that they couldn't possibly exist elsewhere on the internet.

The site was then submitted to Google Search Console and Bing Webmaster Tools before entering a quiet period of approximately one month. During this time, no intentional visits occurred, creating a controlled environment where any subsequent traffic could be directly attributed to AI agent activity.

Discovery Patterns: How AI Agents Find New Content

An interesting disconnect we found between traditional search engine indexing and AI agent discovery. While both Google and Bing successfully indexed the experimental site, Perplexity initially couldn't locate it despite the presence of unique tokens that should have made discovery straightforward. We believe that Perplexity doesn't rely directly on Google or Bing indexes for content discovery. Instead, it appears to maintain its own internal index or use a different discovery mechanism altogether. In contrast, ChatGPT evidently relying on existing search engines, more than one, made it to find and access the experimental URL on the first attempt.

Can We Just Track LLMs With User Agent String?

So the first attempt: Can we just track all the llm traffic by looking at user-agent string. I mean, they should say who they are when they visit, right? right? It turns out they don't.Both Perplexity and ChatGPT follow the same playbook—switching between “bot-branded” agents and common browser disguises.

The test setup:

We asked an LLM a first question that required fetching content from the web → This triggered its initial crawl.
We asked a follow-up question in the same conversation that also needed a web fetch → This triggered a second crawl.
Finally, we repeated the experiment using each platform’s assistant/agent mode (Perplexity’s Comet browser; ChatGPT’s “agent” toggle) → This triggered an agent crawl.

Perplexity’s behavior:

Initial crawl (bot-branded):

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)

Follow-up requests (generic browser):

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36

Comet browser assistant:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36

ChatGPT follows the exact same behavior

Bot-branded crawling for initial discovery:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot

Mobile Safari simulation for follow-up requests:

Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1

When "agent" is enabled, it does Generic Chrome simulation:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36

The use of different user agents is primarily for evading blockers, as per OpenAI support - OpenAI intentionally uses common browser user agents for ChatGPT to avoid being blocked or flagged by website security systems that target unknown or automated traffic. Mimicking a regular browser (like Chrome on Mac) helps ensure smooth access without triggering additional security measures. Additionally, OpenAI does not provide dedicated IP ranges for ChatGPT, as the service operates across dynamic, cloud-based infrastructure that can change over time.

But may be there is a way

So, how can we reliably identify agent or bot traffic? OpenAI does provide a documented method, but with an important caveat I discovered through testing. Genuine ChatGPT agent-mode visits (where a user explicitly enables “agent” and then triggers the search) include signature-based HTTP headers. Specifically, these requests carry a signature-agent header set to "https://chatgpt.com" (including the quotes), along with message signatures following RFC 9421.

However, standard ChatGPT browsing/search traffic—when the user has not enabled “agent” mode—does not include these headers. In those cases, the requests look like ordinary browser traffic and can’t be distinguished from generic bots or human visits. This means the signature-based headers are currently only a reliable marker of “agent” initiated traffic, not for all ChatGPT activity.

Ok. It's All Good then? Can We Reliably Track LLM Visits?

Sorry for being the bearer of bad news. Not really! Perhaps the most significant finding from this experiment is the cache-first behavior exhibited by both AI systems. Multiple instances were observed where both Perplexity and ChatGPT cited our fake store website as a source without generating any corresponding server hits. This proves that both systems maintain substantial content caches.

Entering Cache

The implications are profound: when you see an AI system citing your website as a source, it may not be accessing your current content at all. Instead, it could be working with cached versions that may be minutes, hours, or even days old. While this is how our traditional search engines work, in those engines, humans will have to click on the blue links to see the content which registers a visit to our analytics. But for AI agents and LLMs, they cache the content and answer the question to the user without telling the website owners about the situation. So unless LLMs come up with something like a search console, we don't really have any ways to figure out the traffic volume reliably.

The experimental observations also reveal interesting patterns in cache refresh behavior. After approximately 20 minutes of inactivity, new queries to ChatGPT did trigger fresh crawls of the experimental site. However, subsequent related questions within that timeframe relied on the newly cached content without additional server hits. This suggests that AI systems employ time-based cache invalidation strategies rather than request-based refreshing. The practical implication is that the number of visits your analytics tool (even on the server side) says can be wrong by a few order of magnitude. The caching behavior appears to be global rather than user-specific. The cache expiry mechanisms don't seem to be associated with particular user accounts, suggesting that AI systems maintain shared content repositories that serve multiple users. This approach likely improves response times and reduces server load, but it also means that content freshness is managed at a system level rather than per user.

Cache on URL

We also have reasons to believe that caches are associated with individual URLs. See the below transaction I have had with ChatGPT. Note that this was after a few transactions about the store and product name already.

To test how ChatGPT handled cached vs. fresh content, we repeated the same sequence of questions (Q1–Q3) from different geographies and time intervals. We also phrased each question in slightly different variants to ensure the results weren’t simply due to caching on the exact text string (though that was very unlikely). The table below summarizes whether a server hit occurred.

Question	0th time (India)	5 min later (USA)	10 min later (India)	20 min later (Canada)
Q1: What do you know about <product name>	Visited	Cache	Cache	Visited
Q2: I am talking about the <store name> What merchandise is available and cost	Visited	Cache	Cache	Visited
Q3: Are you sure? I know there are some in the shop </shop URL>	Visited	Cache	Cache	Visited

How ChatGPT Handles Repeated Queries: Cache vs. Target Website Visits

Many responses seem to come from a global cache that refreshes when triggered by a user request if it happens after the cache expiry. I call it a “global cache” because the expiry doesn’t appear tied to any specific user account. Instead, it seems to work by storing a very large number of recently accessed URLs, regardless of who requested them. To test this, we tried multiple accounts across different geographic locations, using real user connections (not VPNs), and the cache behavior remained consistent in every case.

Concluding

When we tested how AI tools like ChatGPT and Perplexity interact with websites, we discovered that these large language models often cite content without actually visiting the site in real time, relying heavily on cached versions instead. Our experiment with a uniquely named fake e-commerce store showed that both tools use a mix of bot-branded and generic browser user agents, making simple tracking by user agent string unreliable. While ChatGPT offers an advanced, signature-based verification method for tracking its visits, actual server hits remain scarce because answers are frequently generated from cached content stored in globally shared repositories. This means site analytics may undercount “visits” by orders of magnitude, as LLMs serve fresh answers without freshly accessing web pages each time—and unless AI platforms provide analytics tools similar to search consoles, truly accurate tracking of LLM-driven traffic will remain out of reach.

Stay ahead of the AI search revolution

LLM-based search is transforming user behavior rapidly. Subscribe to get exclusive insights from our experiments, discoveries, and strategies that keep you competitive in this evolving landscape.

No spam. Unsubscribe anytime. Updates only when we have valuable insights to share.

Back to blog