Expand my Community achievements bar.

Enhance your AEM Assets & Boost Your Development: [AEM Gems | June 19, 2024] Improving the Developer Experience with New APIs and Events
SOLVED

Google crawl not indexing aem pages which require login

Avatar

Level 4

Hi Team,

 

The issue we are facing is google crawl is unable to index our site post login pages. Does anyone else faced this issue? I could not find anything concrete so far to fix this. Any pointers appreciated

We have configured sitemap for our site. This contain all post logging in pages. Looks like when google is trying to crawl to post login pages, our dispatcher settings redirect to login page (since its a post login page), hence google never index the post login pages.

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Level 10

@manisha594391 I checked with team who were were planning to crawl gated content. @BrianKasingli is absolutely correct, there is no way we can currently crawl gated content pages. They created some users and API at our end to help them authenticate and asked them to pass their requests with some headers to us , which will help us identify them. But they didn't had that capability.


Note: As suggested earlier, we can use third Party Crawling Engine which we can use to pass any authentication to get crawl and this is how they resolved this issue.

Special thanks to @VeenaVikraman for quick response.

 

View solution in original post

12 Replies

Avatar

Level 10

@manisha594391 Page should be publicly available to get crawl. If you are hitting dispatcher URL and getting redirect to login page.. this means these pages will not get crawl and index.

But, why do we want to index such pages which require authentication on first place, should not be a case.

Below are the bare minimum for any page to get crawl:
Accessible Website: The website must be accessible to Google's web crawlers (also known as Googlebot). This means the website should not block Googlebot's access using robots.txt or other methods.
XML Sitemap: Providing an XML sitemap helps Google discover and crawl pages more efficiently, especially for large or complex websites.
HTTPS Security: Google prioritizes secure websites (those using HTTPS) in search results. Implementing HTTPS encryption can positively impact crawling and indexing.
Robots.txt: While not always necessary, a properly configured robots.txt file can guide Googlebot on which parts of the site to crawl and which to ignore

 

Pages must not have 302 redirect... though 301 redirect will work.

Avatar

Level 4

Thanks @Imran__Khan for your input. 

Our site is migrated from magento and there google was able to index the post login pages, however we are facing challenges in implementing the same in AEM.

Avatar

Community Advisor

Google typically indexes web pages that are accessible to its crawlers. However, for pages that require authorization (such as login pages or those behind a paywall), Google's crawlers cannot access the content in the same way they do for public pages. To index content from pages that require authorization, website owners must provide Google with an alternative means of accessing this content.

Avatar

Level 4

hi @BrianKasingli , thanks for the reply ! Could you please point me to any examples or links for alternative means for google crawlers for post logged in content

Avatar

Community Advisor

I don't think there's a way to expose authenticated pages to search index and crawlers; however, if your content is not sensitive, what you can do is make all those pages publically accessible, and then add JavaScript to hide the content when the users are not logged in; however, this method still posts privacy issues, where people can read the content if they understand how to manipulate the javascript. Or.. you can try something called Closed User Group(CUG) in AEM.

Avatar

Level 4

hi @Imran__Khan , I looked into this post, but it does not have much clarity on what need to be done. Its quite generic post I believe

Avatar

Level 10

@manisha594391 Agree!!!

According to multie blogs, gated content pages can be crawl using tools as mentioned in blog.

https://www.google.com/amp/s/www.thinkific.com/blog/gated-content-strategy/amp/

Stay tuned, let me get more insight on this.

Avatar

Correct answer by
Level 10

@manisha594391 I checked with team who were were planning to crawl gated content. @BrianKasingli is absolutely correct, there is no way we can currently crawl gated content pages. They created some users and API at our end to help them authenticate and asked them to pass their requests with some headers to us , which will help us identify them. But they didn't had that capability.


Note: As suggested earlier, we can use third Party Crawling Engine which we can use to pass any authentication to get crawl and this is how they resolved this issue.

Special thanks to @VeenaVikraman for quick response.

 

Avatar

Level 4

Thanks @Imran__Khan for looking into this. Could you please point me to any adobe article which may add some stats to confirm this.

Also, I am still investigating on other third party tools. I will let you know if I get to know anything concrete on that area.

Avatar

Level 10

@manisha594391 @There is no official note on this from Adobe and that's why we have tools available in market to crawl gated content. If require, you can open adobe ticket and they will provide there response on the same as not supported.

There is nothing related to crawl gated content on Adobe official website. If require, you can check below link:

https://experienceleague.adobe.com/search.html#q=Crawl%20aem&sort=relevancy