When I search about my website with environment name, the non prod domain is getting crawled and indexed in search engine. How do I avoid this? Please help.
Thanks,
Rakesh
Solved! Go to Solution.
Topics help categorize Community content and increase your ability to discover relevant content.
Views
Replies
Total Likes
Hello @rakesh_h2 There are multiple ways you can avoid your non-prod domains to get crawled on search engines, mainly google.
1. Disallowing everything in robots.txt - This is a must do for non-prod environments!
2. Using 'noindex', 'nofollow' meta tags in pages(this is a non practical solution in your case because you want to restrict the whole domain rather than a few pages). You usually use this for approach specific pages to not index(live domains)
3. By not providing sitemap.xml - this is also not a very practical solution, cause usually we have business logic and dispatcher configurations in place to generate/deliver sitemap.xml.
4. Use Google Search Console - This might turn out to be the best bet you can have. Register yourself on this console(free of cost) and create a property to verify your non-prod domains. Once your domains are verified, place a request for removal. This will remove you non-prod domains(They will no longer appear in google search). Once they stop appearing from google search, remove the site verification code(keep it if you want to. It's up to you).
Now your robots.txt with
User-agent: *
Disallow: /
will work perfectly fine. Make sure you have this in-place post you remove, else they will get re-indexed and start appearing in google search.
Thanks
- Bilal
Hi @rakesh_h2, Let me share few approaches:
1. You can restrict your lower environment to certain whitelisted IP address.
2. You can also add a robots.txt to your dispatcher:
https://www.domain.com/robots.txt
User-agent: *
Disallow: /
3. You can implement dispatcher level basic auth.
@Mahedi_Sabuj I have robots.txt in DAM which is the same for all the environments.
User-agent: *
Disallow: /bin/
Sitemap: ......./in.sitemap.xml
Hi @rakesh_h2, You disallow /bin/ for all the environments.
Lower Environment: You need to change robots.txt for lower environment to disallow everything (/).
User-agent: *
Disallow: /
Sitemap: ......./in.sitemap.xml
Production Environment: You can keep /bin/ for production environment only.
User-agent: *
Disallow: /bin/
Sitemap: ......./in.sitemap.xml
@Mahedi_Sabuj Ok.
1. So I have to keep the respective robots.txt file in DAM of the respective AEM Environment and publish it? Is that it?
2. I have a custom domain also defined for each environment like dev.mysite.com, qa.mysite.com etc. Would the above configuration work in that case?
1. So I have to keep the respective robots.txt file in DAM of the respective AEM Environment and publish it? Is that it? [MS]: Yes, You are correct.
2. I have a custom domain also defined for each environment like dev.mysite.com, qa.mysite.com etc. Would the above configuration work in that case? [MS]: Yes, It should work. We follow the same structure for our project as well.
@Mahedi_Sabuj When do we keep multiple robots.txt file based on environments and configure sling resolver factory to pick the right file?
@Mahedi_Sabuj I gave the below in robots.txt and published the same. I still see my non prod domain listed in the search results. (Its been 6 days since I did)
Disallow: /
Thanks,
Rakesh
Hi @rakesh_h2, You need to raise a Removals request from Google Search Console as @bilal_ahmad is mentioned.
@rakesh_h2 There Are multiple approaches to restrict non prod domains from getting crawled. One of them is to enable the required robots meta tag in the page source using custom logic to enable them only on non-prod environments.
<meta name="robots" content="noindex, nofollow, noarchive, nosnippet, nocache" />
More approaches are very well explained in this link.
https://www.albinsblog.com/2021/01/different-approaches-to-block-non-prod-urlsfrom-search-indexing.h...
Hope this helps.
Regards,
Ayush
Hello @rakesh_h2 There are multiple ways you can avoid your non-prod domains to get crawled on search engines, mainly google.
1. Disallowing everything in robots.txt - This is a must do for non-prod environments!
2. Using 'noindex', 'nofollow' meta tags in pages(this is a non practical solution in your case because you want to restrict the whole domain rather than a few pages). You usually use this for approach specific pages to not index(live domains)
3. By not providing sitemap.xml - this is also not a very practical solution, cause usually we have business logic and dispatcher configurations in place to generate/deliver sitemap.xml.
4. Use Google Search Console - This might turn out to be the best bet you can have. Register yourself on this console(free of cost) and create a property to verify your non-prod domains. Once your domains are verified, place a request for removal. This will remove you non-prod domains(They will no longer appear in google search). Once they stop appearing from google search, remove the site verification code(keep it if you want to. It's up to you).
Now your robots.txt with
User-agent: *
Disallow: /
will work perfectly fine. Make sure you have this in-place post you remove, else they will get re-indexed and start appearing in google search.
Thanks
- Bilal
Hi Bilal,
I have followed all the steps you mentioned except 4th one but I can still see pages/domain showing in google search.
Thanks,
Abhishek
Views
Replies
Total Likes