Google indexing site pages to component level | Community
Skip to main content
Level 2
October 16, 2015
Solved

Google indexing site pages to component level

  • October 16, 2015
  • 1 reply
  • 1252 views
The problem:  Google is crawling and indexing our site (good) but somehow they're indexing certain pages all the way down to the component level.  This is likely because they're following JavaScript links for AJAX-generated pagination and the like.  Clicking any of these links obviously ends up with pages rendered without page-level templating, other components, etc. (bad) To duplicate: Google "rand blog rafiq college ratings"

The third result (and several others on the first page of results) link all the way down to the component level:

  http://www.rand.org/content/rand/blog/jcr:content/par/bloglist.ajax.topic.postsecondary-education-programs

We would expect this to link to the actual site page:

  http://www.rand.org/blog.html

 

We've thought of two possible solutions, neither of them good:

1.  Add an Apache-level rewrite for anything *./jcr:content(/.*) to rewrite to the parent page.  This would almost surely work to fix links from Google, but would break site functionality that directly addresses components for AJAX calls (pagination) or alternate page-renderings (XML), etc.

2.  Add affected pages to robot.txt.  Even worse: then we don't even get indexed and likely impossible to effectively keep up with.

Any good strategies out there to force Google to index at the page level *only*?

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by Mshaji

First and foremost, as a best practice,  all of your CQ5 author and publish servers be put behind a firewall, not publicly accessible. Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.

Please review following link

http://crxdelight.com/2012/02/04/how-to-protect-your-cq-instances-from-google-searches/

1 reply

MshajiCommunity AdvisorAccepted solution
Community Advisor
October 16, 2015

First and foremost, as a best practice,  all of your CQ5 author and publish servers be put behind a firewall, not publicly accessible. Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.

Please review following link

http://crxdelight.com/2012/02/04/how-to-protect-your-cq-instances-from-google-searches/