How to avoid non prod domains from getting crawled and indexed on to the search engines? | Adobe Higher Education
Skip to main content
Level 3
July 13, 2023
Beantwortet

How to avoid non prod domains from getting crawled and indexed on to the search engines?

  • July 13, 2023
  • 3 Antworten
  • 5195 Ansichten

When I search about my website with environment name, the non prod domain is getting crawled and indexed in search engine. How do I avoid this? Please help.

 

Thanks,

Rakesh

Beste Antwort von bilal_ahmad

Hello @rakesh_h2 There are multiple ways you can avoid your non-prod domains to get crawled on search engines, mainly google.

 

1. Disallowing everything in robots.txt - This is a must do for non-prod environments!

2. Using 'noindex', 'nofollow' meta tags in pages(this is a non practical solution in your case because you want to restrict the whole domain rather than a few pages). You usually use this for approach specific pages to not index(live domains)

3. By not providing sitemap.xml - this is also not a very practical solution, cause usually we have business logic and dispatcher configurations in place to generate/deliver sitemap.xml.

4. Use Google Search Console - This might turn out to be the best bet you can have. Register yourself on this console(free of cost) and create  a property to verify your non-prod domains. Once your domains are verified, place a request for removal. This will remove you non-prod domains(They will no longer appear in google search). Once they stop appearing from google search, remove the site verification code(keep it if you want to. It's up to you).


Now your robots.txt with

 

 

User-agent: * Disallow: /

 

 

will work perfectly fine. Make sure you have this in-place post you remove, else they will get re-indexed and start appearing in google search.

Thanks
- Bilal

 

 

 

3 Antworten

Mahedi_Sabuj
Community Advisor
Community Advisor
July 13, 2023

Hi @rakesh_h2, Let me share few approaches:

1. You can restrict your lower environment to certain whitelisted IP address.

2. You can also add a robots.txt to your dispatcher:
https://www.domain.com/robots.txt

 

User-agent: * Disallow: /

 

3. You can implement dispatcher level basic auth.

 

Mahedi Sabuj
Level 3
July 13, 2023

@mahedi_sabuj I have robots.txt in DAM which is the same for all the environments.

User-agent: *

Disallow: /bin/

Sitemap: ......./in.sitemap.xml

Mahedi_Sabuj
Community Advisor
Community Advisor
July 13, 2023

Hi @rakesh_h2, You disallow /bin/ for all the environments. 

Lower Environment: You need to change robots.txt for lower environment to disallow everything (/).

User-agent: * Disallow: / Sitemap: ......./in.sitemap.xml

Production Environment: You can keep /bin/ for production environment only.

User-agent: * Disallow: /bin/ Sitemap: ......./in.sitemap.xml

 

Mahedi Sabuj
ayush-anand
Level 4
July 13, 2023

@rakesh_h2 There Are multiple approaches to restrict non prod domains from getting crawled. One of them is to enable the required robots meta tag in the page source using custom logic to enable them only on non-prod environments.


<meta name="robots" content="noindex, nofollow, noarchive, nosnippet, nocache" />

 

More approaches are very well explained in this link.
https://www.albinsblog.com/2021/01/different-approaches-to-block-non-prod-urlsfrom-search-indexing.html?m=1

 

Hope this helps.

 

Regards,

Ayush

bilal_ahmad
Level 5
July 13, 2023

Hello @rakesh_h2 There are multiple ways you can avoid your non-prod domains to get crawled on search engines, mainly google.

 

1. Disallowing everything in robots.txt - This is a must do for non-prod environments!

2. Using 'noindex', 'nofollow' meta tags in pages(this is a non practical solution in your case because you want to restrict the whole domain rather than a few pages). You usually use this for approach specific pages to not index(live domains)

3. By not providing sitemap.xml - this is also not a very practical solution, cause usually we have business logic and dispatcher configurations in place to generate/deliver sitemap.xml.

4. Use Google Search Console - This might turn out to be the best bet you can have. Register yourself on this console(free of cost) and create  a property to verify your non-prod domains. Once your domains are verified, place a request for removal. This will remove you non-prod domains(They will no longer appear in google search). Once they stop appearing from google search, remove the site verification code(keep it if you want to. It's up to you).


Now your robots.txt with

 

 

User-agent: * Disallow: /

 

 

will work perfectly fine. Make sure you have this in-place post you remove, else they will get re-indexed and start appearing in google search.

Thanks
- Bilal

 

 

 

Level 2
August 13, 2024

Hi Bilal,

 

I have followed all the steps you mentioned except 4th one but I can still see pages/domain showing in google search.

 

Thanks,

Abhishek