Thursday, August 8, 2025

How LLMs Like ChatGPT Really Crawl the Web And Cache Content: An AI Agent Behavior Experiment

Sherin Thomas (boatbuilder)
Illustration of an AI robot examining a website through a magnifying glass, observing only a reflected version to symbolize cached content access

When you ask ChatGPT or Perplexity about a website, what happens behind the scenes? Do these AI tools visit your site every single time, or are they pulling from some cache? To find out we ran an experiment and it reveals the surprising truth about how these AI agents actually behave when crawling and retrieving web content. Note that this study was to find out crawling and traffic behavior of AI agents and bots. For human users coming from LLMs like ChatGPT can be identified with referral headers or utm parameters. We are not discussing that here.

What we found was that both ChatGPT and Perplexity frequently cite websites without actually visiting them, relying instead on cached content. Even more intriguing, they sometimes masquerade as regular browsers rather than identifying themselves as AI bots.

The Experiment: A fake e-commerce store

To understand AI crawling behavior without interference from real-world variables, we created an e-commerce website populated entirely with invented product names —unique tokens that nobody would ever search for organically. Think of products with names so unusual that they couldn't possibly exist elsewhere on the internet.

The site was then submitted to Google Search Console and Bing Webmaster Tools before entering a quiet period of approximately one month. During this time, no intentional visits occurred, creating a controlled environment where any subsequent traffic could be directly attributed to AI agent activity.

Discovery Patterns: How AI Agents Find New Content

An interesting disconnect we found between traditional search engine indexing and AI agent discovery. While both Google and Bing successfully indexed the experimental site, Perplexity initially couldn't locate it despite the presence of unique tokens that should have made discovery straightforward. We believe that Perplexity doesn't rely directly on Google or Bing indexes for content discovery. Instead, it appears to maintain its own internal index or use a different discovery mechanism altogether. In contrast, ChatGPT evidently relying on existing search engines, more than one, made it to find and access the experimental URL on the first attempt.

Can We Just Track LLMs With User Agent String?

So the first attempt: Can we just track all the llm traffic by looking at user-agent string. I mean, they should say who they are when they visit, right? right?It turns out they don't. Perplexity exhibits crawling behavior with two distinct patterns. When initially provided with a direct URL, it crawls using its official bot-branded user agent:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)

However, subsequent crawls within the same conversation session used a generic browser user agent:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36

It is very evident that Perplexity employs different crawling strategies depending on the context and timing of requests. The initial bot-branded crawl likely serves for content discovery and indexing, while subsequent generic browser crawls may be used for content updates or verification.

ChatGPT follows the exact same behavior but with even more user-agent strings

  • Bot-branded crawling for initial discovery:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot

  • Mobile Safari simulation for follow-up requests:

Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1

  • When "agent" is enabled, it does Generic Chrome simulation:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36

The use of different user agents is primarily for evading blockers, as per OpenAI support - OpenAI intentionally uses common browser user agents for ChatGPT to avoid being blocked or flagged by website security systems that target unknown or automated traffic. Mimicking a regular browser (like Chrome on Mac) helps ensure smooth access without triggering additional security measures. Additionally, OpenAI does not provide dedicated IP ranges for ChatGPT, as the service operates across dynamic, cloud-based infrastructure that can change over time.

But may be there is a way

So, How can we reliably identify agent or bot traffic? For ChatGPT, OpenAI provides a documented method (not yet tested in our setup, but we plan to update this). Genuine ChatGPT agent visits include signature-based HTTP headers: specifically, a Signature-Agent header set to "https://chatgpt.com" (including the quotes) and message signatures as per RFC 9421. Cloudflare users can automatically allowlist this verified bot, while other platforms can validate the signature using OpenAI’s public key. This makes it possible to clearly distinguish ChatGPT agent traffic from generic bots or browsers.

Ok. It's All Good then? Can We Reliably Track LLM Visits?

Sorry for being the bearer of bad news. Not really! Perhaps the most significant finding from this experiment is the cache-first behavior exhibited by both AI systems. Multiple instances were observed where both Perplexity and ChatGPT cited our fake store website as a source without generating any corresponding server hits. This proves that both systems maintain substantial content caches.

Entering Cache

The implications are profound: when you see an AI system citing your website as a source, it may not be accessing your current content at all. Instead, it could be working with cached versions that may be minutes, hours, or even days old. While this is how our traditional search engines work, in those engines, humans will have to click on the blue links to see the content which registers a visit to our analytics. But for AI agents and LLMs, they cache the content and answer the question to the user without telling the website owners about the situation. So unless LLMs come up with something like a search console, we don't really have any ways to figure out the traffic volume reliably.

The experimental observations also reveal interesting patterns in cache refresh behavior. After approximately 40 minutes of inactivity, new queries to ChatGPT did trigger fresh crawls of the experimental site. However, subsequent related questions within that timeframe relied on the newly cached content without additional server hits. This suggests that AI systems employ time-based cache invalidation strategies rather than request-based refreshing. The practical implication is that the number of visits your analytics tool (even on the server side) says can be wrong by a few order of magnitude. The caching behavior appears to be global rather than user-specific. The cache expiry mechanisms don't seem to be associated with particular user accounts, suggesting that AI systems maintain shared content repositories that serve multiple users. This approach likely improves response times and reduces server load, but it also means that content freshness is managed at a system level rather than per user.

Cache on URL

We also have reasons to believe that caches are associated with individual URLs. See the below transaction I have had with ChatGPT. Note that this was after a few transactions about the store and product name already.

Q1: “What do you know about <product name>?”

Observation: No server hits; ChatGPT recognized the site’s existence but wasn’t confident about the reference. Answer had a reference to my website but also other website where some remotely similar named product exist

Q2: “I am talking about the <store name> one. I need to know what merchandise is available for <product reference> and how much they cost.”

Observation: No server hits; answer said it didn’t know about any merchandise. The conversation showed /shop-url in the history but didn’t include the merchandise details from there. Probably because it considered only / for answering that

Q3: “Are you sure? I know there are some in the shop – </shop url directly presented>”

Observation: With the direct URL, ChatGPT returned the correct information, again but without a server hit. In a later run (about 40 minutes later-probably after cache invalidation), this query did make a server hit

Many answers appear to be served from a global cache, with periodic refreshes that generate bot‑branded or generic browser UA hits. I call this global cache because the cache expiry mechanism doesn’t seem to be associated with any particular user account. I am assuming they just keep the n number of last hit website url in the cache where n is a very large number

Concluding

When we tested how AI tools like ChatGPT and Perplexity interact with websites, we discovered that these large language models often cite content without actually visiting the site in real time, relying heavily on cached versions instead. Our experiment with a uniquely named fake e-commerce store showed that both tools use a mix of bot-branded and generic browser user agents, making simple tracking by user agent string unreliable. While ChatGPT offers an advanced, signature-based verification method for tracking its visits, actual server hits remain scarce because answers are frequently generated from cached content stored in globally shared repositories. This means site analytics may undercount “visits” by orders of magnitude, as LLMs serve fresh answers without freshly accessing web pages each time—and unless AI platforms provide analytics tools similar to search consoles, truly accurate tracking of LLM-driven traffic will remain out of reach.

How LLMs Like ChatGPT Really Crawl the Web And Cache Content: An AI Agent Behavior Experiment | The #1 AI agent for Inbound Marketing