Level 3

Question

LLMs Crawling Origin /content/ URLs: An LLMO Visibility Issue

Forum|Forum|2 months ago
May 13, 2026
4 replies
89 views

Large Language Models (LLMs) crawl and ingest origin-level URLs like /content/yoursite even when those URLs are rewritten or hidden from human users. Especially this is happening for home page URLs.

How can we identify from where this is happening and how can we minimize this?

avesh_narang

Level 4

LLMs are discovering origin URLs through indirect exposure and pattern based crawling. Even if /content/.. is hidden from users , it often leaks via HTML source (canonical tags etc. ), APIs, JavaScript or historical indexing. To identify you may use available AI crawlers like PerplexityBot, ClaudBot etc. but to minimize it , you need to ensure only CDN can access origin endpoints not beyond CDN .Also sanitize the outputs via removing the origin paths from canonical tags , APIs etc. Additionally you can disallow them in Robots.txt .

Thanks

S

SubbaraoGa1

Adobe Employee

@gvaem

The likely cause is that the origin-style URL (/content/...) is still discoverable and/or publicly reachable. Even if human users only see rewritten URLs, crawlers can still request /content/... directly if it is exposed through page metadata, sitemaps, canonical tags, redirects, structured data, or prior indexing. SEO and URL Management Best Practices for Adobe Experience Manager as a Cloud Service

To identify the source, we recommend checking CDN / Dispatcher logs for requests to /content/... and reviewing user agent, referrer, IP/network, and status code patterns. Understand Cloud Service Content Requests

To minimize this, the most effective actions are:

Block or restrict direct public access to /content/... paths
Ensure canonical tags, sitemap entries, hreflang, and structured data use only the public rewritten URL
Use CDN traffic rules to log first, then block or rate-limit suspicious requests
Use robots.txt / X-Robots-Tag only as secondary controls, not as the primary protection layer

chaudharynick

Level 4

Hi @gvaem

Even if your URL rewriting is working for the browser's address bar, the origin path is likely hidden in plain sight.

The Canonical Tag Leak: Check your <link rel="canonical">. If your SEO logic isn't properly using the ResourceResolver mapping, it might be outputting the full JCR path (e.g., /content/mysite/en/home.html) instead of the vanity URL.
Sitemaps: This is the #1 culprit. If your sitemap generator isn't configured to use externalized URLs, it provides a literal roadmap for crawlers to ingest your /content structure.
Schema.org / JSON-LD: Modern sites use structured data for SEO. If your metadata scripts are pulling the resource path directly from the JCR without passing it through a mapping service, the crawler sees the origin path in the script block.
AEM Link Checker Transformer: If the Link Checker isn't configured to "Rewrite All," it might leave certain paths (like those in JS variables or data-attributes) as internal paths.
Log Analysis: Search your CDN (CloudFront/Akamai) or Dispatcher access logs for specific AI User-Agents (e.g., GPTBot, CCBot, Claude-Web) hitting paths starting with /content. This will tell you exactly which pages are acting as the entry point.

Once you've identified the source, you need to apply a multi-layered defense.

A. Strict Dispatcher Filters

The most effective way is to ensure that /content/mysite is never accessible directly via a public request.

Update your filter.any to allow only the vanity URLs.
Specifically block access to the /content root for any request that doesn't have a valid internal header (if applicable).
Rule of thumb: If the user (or bot) doesn't need to see /content, the Dispatcher should return a 404 or 403.

B. Use `X-Robots-Tag` at the CDN/Dispatcher Level

Instead of just relying on a robots.txt file (which bots can ignore), inject headers into the response. You can configure the Dispatcher to add an X-Robots-Tag: noindex, nofollow header specifically when the request path starts with /content.

C. Update `robots.txt` for AI Agents

While traditional SEO bots follow standard rules, you should explicitly target AI crawlers.

User-agent: GPTBot
Disallow: /content/

User-agent: CCBot
Disallow: /content/

User-agent: Claude-Web
Disallow: /content/

G

gvaemAuthor

Level 3

Thank you for the information @chaudharynick

A. Strict Dispatcher Filters

B. Use X-Robots-Tag at the CDN/Dispatcher Level

C. Update robots.txt for AI Agents

Sign up

Login with SSO

Login to the community

Login with SSO

B. Use `X-Robots-Tag` at the CDN/Dispatcher Level

C. Update `robots.txt` for AI Agents