Expand my Community achievements bar.

Learn about Edge Delivery Services in upcoming GEM session
SOLVED

Google indexing site pages to component level

Avatar

Level 2
The problem:  Google is crawling and indexing our site (good) but somehow they're indexing certain pages all the way down to the component level.  This is likely because they're following JavaScript links for AJAX-generated pagination and the like.  Clicking any of these links obviously ends up with pages rendered without page-level templating, other components, etc. (bad) To duplicate: Google "rand blog rafiq college ratings"

The third result (and several others on the first page of results) link all the way down to the component level:

  http://www.rand.org/content/rand/blog/jcr:content/par/bloglist.ajax.topic.postsecondary-education-pr...

We would expect this to link to the actual site page:

  http://www.rand.org/blog.html

 

We've thought of two possible solutions, neither of them good:

1.  Add an Apache-level rewrite for anything *./jcr:content(/.*) to rewrite to the parent page.  This would almost surely work to fix links from Google, but would break site functionality that directly addresses components for AJAX calls (pagination) or alternate page-renderings (XML), etc.

2.  Add affected pages to robot.txt.  Even worse: then we don't even get indexed and likely impossible to effectively keep up with.

Any good strategies out there to force Google to index at the page level *only*?

1 Accepted Solution

Avatar

Correct answer by
Level 9

First and foremost, as a best practice,  all of your CQ5 author and publish servers be put behind a firewall, not publicly accessible. Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.

Please review following link

http://crxdelight.com/2012/02/04/how-to-protect-your-cq-instances-from-google-searches/

View solution in original post

1 Reply

Avatar

Correct answer by
Level 9

First and foremost, as a best practice,  all of your CQ5 author and publish servers be put behind a firewall, not publicly accessible. Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.

Please review following link

http://crxdelight.com/2012/02/04/how-to-protect-your-cq-instances-from-google-searches/