Expand my Community achievements bar.

SOLVED

How to avoid non prod domains from getting crawled and indexed on to the search engines?

Avatar

Level 4

When I search about my website with environment name, the non prod domain is getting crawled and indexed in search engine. How do I avoid this? Please help.

 

Thanks,

Rakesh

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

Hello @rakesh_h2 There are multiple ways you can avoid your non-prod domains to get crawled on search engines, mainly google.

 

1. Disallowing everything in robots.txt - This is a must do for non-prod environments!

2. Using 'noindex', 'nofollow' meta tags in pages(this is a non practical solution in your case because you want to restrict the whole domain rather than a few pages). You usually use this for approach specific pages to not index(live domains)

3. By not providing sitemap.xml - this is also not a very practical solution, cause usually we have business logic and dispatcher configurations in place to generate/deliver sitemap.xml.

4. Use Google Search Console - This might turn out to be the best bet you can have. Register yourself on this console(free of cost) and create  a property to verify your non-prod domains. Once your domains are verified, place a request for removal. This will remove you non-prod domains(They will no longer appear in google search). Once they stop appearing from google search, remove the site verification code(keep it if you want to. It's up to you).

bilal_ahmad_1-1689270583503.png


Now your robots.txt with

 

 

User-agent: *
Disallow: /

 

 

will work perfectly fine. Make sure you have this in-place post you remove, else they will get re-indexed and start appearing in google search.

Thanks
- Bilal

 

 

 

View solution in original post

10 Replies

Avatar

Community Advisor

Hi @rakesh_h2, Let me share few approaches:

1. You can restrict your lower environment to certain whitelisted IP address.

2. You can also add a robots.txt to your dispatcher:
https://www.domain.com/robots.txt

 

User-agent: *
Disallow: /

 

3. You can implement dispatcher level basic auth.

 

Avatar

Level 4

@Mahedi_Sabuj I have robots.txt in DAM which is the same for all the environments.

User-agent: *

Disallow: /bin/

Sitemap: ......./in.sitemap.xml

Avatar

Community Advisor

Hi @rakesh_h2, You disallow /bin/ for all the environments. 

Lower Environment: You need to change robots.txt for lower environment to disallow everything (/).

User-agent: *
Disallow: /
Sitemap: ......./in.sitemap.xml

Production Environment: You can keep /bin/ for production environment only.

User-agent: *
Disallow: /bin/
Sitemap: ......./in.sitemap.xml

 

Avatar

Level 4

@Mahedi_Sabuj Ok.

1. So I have to keep the respective robots.txt file in DAM of the respective AEM Environment and publish it? Is that it?

2. I have a custom domain also defined for each environment like dev.mysite.com, qa.mysite.com etc. Would the above configuration work in that case?

Avatar

Community Advisor

1. So I have to keep the respective robots.txt file in DAM of the respective AEM Environment and publish it? Is that it? [MS]: Yes, You are correct.

2. I have a custom domain also defined for each environment like dev.mysite.com, qa.mysite.com etc. Would the above configuration work in that case? [MS]: Yes, It should work. We follow the same structure for our project as well.

Avatar

Level 4

@Mahedi_Sabuj When do we keep multiple robots.txt file based on environments and configure sling resolver factory to pick the right file?

Avatar

Level 4

@Mahedi_Sabuj I gave the below in robots.txt and published the same. I still see my non prod domain listed in the search results. (Its been 6 days since I did)

Disallow: /

 

Thanks,

Rakesh

Avatar

Community Advisor

Hi @rakesh_h2, You need to raise a Removals request from Google Search Console as @bilal_ahmad is mentioned.

Avatar

Level 4

@rakesh_h2 There Are multiple approaches to restrict non prod domains from getting crawled. One of them is to enable the required robots meta tag in the page source using custom logic to enable them only on non-prod environments.


<meta name="robots" content="noindex, nofollow, noarchive, nosnippet, nocache" />

 

More approaches are very well explained in this link.
https://www.albinsblog.com/2021/01/different-approaches-to-block-non-prod-urlsfrom-search-indexing.h...

 

Hope this helps.

 

Regards,

Ayush

Avatar

Correct answer by
Community Advisor

Hello @rakesh_h2 There are multiple ways you can avoid your non-prod domains to get crawled on search engines, mainly google.

 

1. Disallowing everything in robots.txt - This is a must do for non-prod environments!

2. Using 'noindex', 'nofollow' meta tags in pages(this is a non practical solution in your case because you want to restrict the whole domain rather than a few pages). You usually use this for approach specific pages to not index(live domains)

3. By not providing sitemap.xml - this is also not a very practical solution, cause usually we have business logic and dispatcher configurations in place to generate/deliver sitemap.xml.

4. Use Google Search Console - This might turn out to be the best bet you can have. Register yourself on this console(free of cost) and create  a property to verify your non-prod domains. Once your domains are verified, place a request for removal. This will remove you non-prod domains(They will no longer appear in google search). Once they stop appearing from google search, remove the site verification code(keep it if you want to. It's up to you).

bilal_ahmad_1-1689270583503.png


Now your robots.txt with

 

 

User-agent: *
Disallow: /

 

 

will work perfectly fine. Make sure you have this in-place post you remove, else they will get re-indexed and start appearing in google search.

Thanks
- Bilal