Wednesday, September 9, 2025

Setting Up Honeypots to Monitor AI Agent Crawling Behavior for SEO

Sherin Thomas (boatbuilder)
Illustration of a honey jar attracting AI icons like chatbot, search, eye, and speech bubble, with the text ‘Honeypot for AI SEO’ above.

To understand the evolution of Search and specifically AI-powered search, we conducted a series of experimental study to understand how modern AI agents interact with web content and what insights this behavior could reveal. This research emerged from our need to understand their potential impact on content discovery and SEO strategies. This particular blog post is about one of the attempt we did by setting up "honeypots" on an experimental website.

Background and Motivation

Traditional web crawling has been dominated by predictable search engine bots like Googlebot, which follow established patterns and protocols. However, the emergence of AI agents represents a fundamental shift in how automated systems interact with web content. These agents demonstrate sophisticated reasoning capabilities, can adapt their strategies based on content context, and employ complex query decomposition techniques known as "query fan-out".

Recent academic research has begun exploring AI agent detection in cybersecurity contexts, with studies like the LLM Agent Honeypot project documenting over 8 million interaction attempts and identifying potential AI-driven attacks. Our study builds on this foundation but focuses specifically on understanding AI search agent behavior for SEO and content optimization insights.

Market Context and Timing

The timing of our experiment proved significant. Google's search market share has declined to 89.83% as of August 2025, marking the first time below 90% since 2015. Meanwhile, the AI search engine market is projected to reach $50.88 billion by 2033, growing at 13.6% CAGR. This shift suggested that understanding AI agent behavior would become increasingly valuable for digital content strategies.

Experimental Design and Methodology

Technical Implementation

We implemented our experimental setup using a Next.js application with custom middleware designed to detect incoming AI agent requests. Our approach was built on top of the study we did on understanding user-agent string AI agents and bots uses when crawling. Because these agents and bots don't use any deterministically identifiable user-agent strings in the http request headers, we first made a request from both Perplexity (Comet assistant) and ChatGPT (with agent mode), and took the user-agent string these agents used at that point and changed our middleware code to return specially crafted content for these agents.

middleware.js code that returns custom text for AI Agents to redirect them to honeypot

When our system detected an AI agent, instead of serving standard webpage content, we returned specialized instructions directing the agent to a dedicated honeypot endpoint. This honeypot redirection content explained that the honeypot API was optimized for agent access and provided structured data more suitable for AI processing than traditional HTML scraping.

Data Verification

To ensure we captured real-time agent behavior rather than cached search results, we implemented different pricing data between our website display and honeypot API responses. This verification method was crucial for confirming that agents were accessing the honeypot API instead of the html data which may have cached at some layer.

Key Findings and Insights

We configured our API to require agent identification as a parameter, enabling precise behavior tracking and attribution. This data collection approach revealed patterns in query types, agent preferences, and interaction frequencies that would otherwise remain invisible to content creators and SEO professionals.

url params collected by the honeypot in json

Our experiments showed ChatGPT agents consistently requested user permission before making API calls. When presented with our honeypot instructions, ChatGPT would display the API information to the user and ask for explicit consent to make the API request. This behavior suggests a more conservative approach to external resource access, likely due to safety protocols implemented by OpenAI.

ChatGPT asking for user confirmation before proceeding with instructions from a webpage

In contrast, Perplexity agents executed API calls directly without user confirmation. When encountering our honeypot content, these agents immediately followed the provided instructions and made requests to our API endpoint. This more aggressive data collection approach aligns with Perplexity's real-time search integration model.

Fanout Query Discovery

By monitoring API requests, we were able to capture the actual sub-queries generated by AI agents, a process recognized in the SEO community as “fanout.” This means that when an AI agent is assigned a complex user query, it breaks it down into multiple more targeted component queries to fulfill the broad search intent. While we can’t fully guarantee that the queries triggered by a "web search" tool are always identical to those sent to our honeypot API, our evidence and intuition both suggest that the underlying fanout logic is consistent regardless of the endpoint. The prompt guidance provided to agents for decomposing queries is very likely the same process used when they decide which calls to make to any data source, including our API. We plan to validate this further by applying our setup to a live site with substantial AI search engine traffic, to better align our observations with real-world usage.

Attribution Accuracy

Unlike traditional analytics, our system provides highly granular attribution, pinpointing not only which AI agents have accessed our content, but also how and when each one engaged with it. Because our API endpoint can require agents to submit identifying information as part of the request, we gain precise visibility into the “who” and “what” behind each interaction.

Error message to Agents those doesn't identify themselves to the honeypot

This level of control means we can design analytics tailored to the needs of site owners, for example, by collecting agent names or other metadata provided via system/user prompt. It’s important to recognize, however, that this mechanism also surfaces privacy and security concerns: in theory, a malicious API might elicit sensitive agent-held data. Fortunately, frameworks such as OpenAI’s Model Spec are being developed specifically to guide agents and platforms in building robust safeguards for user and agent privacy, while enabling responsible attribution.

Custom Content For Agents

Perhaps the most transformative result from our experiment is the demonstration that we can serve custom, optimized content to AI agents based on their programmatic queries. For example,

HTML view for humans with all the textual information that is unnecessary for an AI agent

The HTML version includes marketing language, visual styling, and contextual information designed to persuade and engage human visitors. This format requires AI agents to parse through unnecessary text and interpret ambiguous elements like "limited-time" offers.

Structured data for AI Agents served from the API that is crafted only for Agents

The JSON response delivers only the essential data points in a structured format that AI agents can instantly process without interpretation. This eliminates parsing overhead and reduces the risk of hallucination or missed information.

This tailored approach allows site owners to present the most relevant information in a format that’s both easy for agents to process and free from ambiguity, greatly reducing the risk of hallucination or missed details that can occur when agents try to interpret unstructured HTML. For example, by supplying agents with structured JSON or targeted data, we minimize guesswork and empower better, more accurate AI-powered results for end users. The potential here is significant: you can maintain separate, highly-tuned “information delivery” pipelines for human readers and for AI agents, each one designed for optimal performance depending on the consumer.

Market Timing and Future Implications

While AI agents currently represent a small fraction of total web traffic, our experiment was conducted during a period of significant market evolution. The AI search engine market's projected growth and the decline in traditional search dominance suggest that understanding agent behavior will become increasingly valuable.

The behavioral differences we observed between ChatGPT and Perplexity agents highlight the importance of understanding each platform's unique approach to content access and user interaction. These differences have implications for how content creators might optimize for different AI search platforms.

Conclusions

Our experimental study revealed significant insights into how AI agents interact with web content and process information for search applications. The distinct behavioral patterns we observed between different AI agents, the effectiveness of instructions in webpages, content strategies that consider agent friendly data formats instead of HTML, and the value of real-time behavior monitoring all suggest important implications for the future of content optimization and search strategy.

The ability to capture fanout queries, track agent attribution, and understand content processing preferences provides a foundation for adapting to the evolving search landscape. While AI agents currently represent a small portion of web traffic, understanding and optimizing for these interactions will become increasingly important as agentic search continues to grow.

This research represents an early exploration of AI agent behavior in search contexts, demonstrating how monitoring techniques can be adapted for understanding AI interactions with web content. As the search ecosystem continues to evolve, studies like this will be crucial for helping content creators and SEO professionals adapt to these new technological realities.

Stay ahead of the AI search revolution

LLM-based search is transforming user behavior rapidly. Subscribe to get exclusive insights from our experiments, discoveries, and strategies that keep you competitive in this evolving landscape.

No spam. Unsubscribe anytime. Updates only when we have valuable insights to share.