Skip to main content
Level 3
May 13, 2026
Question

LLMs Crawling Origin /content/ URLs: An LLMO Visibility Issue

  • May 13, 2026
  • 1 reply
  • 12 views

Large Language Models (LLMs) crawl and ingest origin-level URLs like /content/yoursite even when those URLs are rewritten or hidden from human users. Especially this is happening for home page URLs. 

How can we identify from where this is happening and how can we minimize this?

1 reply

avesh_narang
Level 4
May 14, 2026

LLMs are discovering origin URLs through indirect exposure and pattern based crawling. Even if /content/.. is hidden from users , it often leaks via HTML source (canonical tags etc. ), APIs, JavaScript or historical indexing. To identify you may use available AI crawlers like PerplexityBot, ClaudBot etc. but to minimize it , you need to ensure only CDN can access origin endpoints not beyond CDN .Also sanitize the outputs via removing the origin paths from canonical tags , APIs etc. Additionally you can disallow them in Robots.txt .

Thanks