Skip to main content
Level 3
May 13, 2026
Question

LLMs Crawling Origin /content/ URLs: An LLMO Visibility Issue

  • May 13, 2026
  • 3 replies
  • 49 views

Large Language Models (LLMs) crawl and ingest origin-level URLs like /content/yoursite even when those URLs are rewritten or hidden from human users. Especially this is happening for home page URLs. 

How can we identify from where this is happening and how can we minimize this?

3 replies

avesh_narang
Level 4
May 14, 2026

LLMs are discovering origin URLs through indirect exposure and pattern based crawling. Even if /content/.. is hidden from users , it often leaks via HTML source (canonical tags etc. ), APIs, JavaScript or historical indexing. To identify you may use available AI crawlers like PerplexityBot, ClaudBot etc. but to minimize it , you need to ensure only CDN can access origin endpoints not beyond CDN .Also sanitize the outputs via removing the origin paths from canonical tags , APIs etc. Additionally you can disallow them in Robots.txt .

Thanks 

Adobe Employee
May 14, 2026

@gvaem 

The likely cause is that the origin-style URL (/content/...) is still discoverable and/or publicly reachable. Even if human users only see rewritten URLs, crawlers can still request /content/... directly if it is exposed through page metadata, sitemaps, canonical tags, redirects, structured data, or prior indexing. SEO and URL Management Best Practices for Adobe Experience Manager as a Cloud Service

To identify the source, we recommend checking CDN / Dispatcher logs for requests to /content/... and reviewing user agent, referrer, IP/network, and status code patterns. Understand Cloud Service Content Requests

To minimize this, the most effective actions are:

  • Block or restrict direct public access to /content/... paths
  • Ensure canonical tags, sitemap entries, hreflang, and structured data use only the public rewritten URL
  • Use CDN traffic rules to log first, then block or rate-limit suspicious requests
  • Use robots.txt / X-Robots-Tag only as secondary controls, not as the primary protection layer
chaudharynick
Level 4
May 15, 2026

Hi ​@gvaem 

Even if your URL rewriting is working for the browser's address bar, the origin path is likely hidden in plain sight.

  • The Canonical Tag Leak: Check your <link rel="canonical">. If your SEO logic isn't properly using the ResourceResolver mapping, it might be outputting the full JCR path (e.g., /content/mysite/en/home.html) instead of the vanity URL.

  • Sitemaps: This is the #1 culprit. If your sitemap generator isn't configured to use externalized URLs, it provides a literal roadmap for crawlers to ingest your /content structure.

  • Schema.org / JSON-LD: Modern sites use structured data for SEO. If your metadata scripts are pulling the resource path directly from the JCR without passing it through a mapping service, the crawler sees the origin path in the script block.

  • AEM Link Checker Transformer: If the Link Checker isn't configured to "Rewrite All," it might leave certain paths (like those in JS variables or data-attributes) as internal paths.

  • Log Analysis: Search your CDN (CloudFront/Akamai) or Dispatcher access logs for specific AI User-Agents (e.g., GPTBotCCBotClaude-Web) hitting paths starting with /content. This will tell you exactly which pages are acting as the entry point.

Once you've identified the source, you need to apply a multi-layered defense.

A. Strict Dispatcher Filters

The most effective way is to ensure that /content/mysite is never accessible directly via a public request.

  • Update your filter.any to allow only the vanity URLs.

  • Specifically block access to the /content root for any request that doesn't have a valid internal header (if applicable).

  • Rule of thumb: If the user (or bot) doesn't need to see /content, the Dispatcher should return a 404 or 403.

B. Use X-Robots-Tag at the CDN/Dispatcher Level

Instead of just relying on a robots.txt file (which bots can ignore), inject headers into the response. You can configure the Dispatcher to add an X-Robots-Tag: noindex, nofollow header specifically when the request path starts with /content.

C. Update robots.txt for AI Agents

While traditional SEO bots follow standard rules, you should explicitly target AI crawlers.

User-agent: GPTBot
Disallow: /content/

User-agent: CCBot
Disallow: /content/

User-agent: Claude-Web
Disallow: /content/