Disallow AEM publish domain url in robots.txt file | Community
Skip to main content
supriya-hande
Level 4
January 9, 2024

Disallow AEM publish domain url in robots.txt file

  • January 9, 2024
  • 6 replies
  • 3510 views

Hello Everyone,

 

I am following this url to create robots.txt file in prod author: https://www.aemtutorial.info/2020/07/robotstxt-file-in-aem-websites.html

For our website publish urls are getting crawled and displayed in Google Search console. So we want to disallow publish domain url  in robots.txt file and only allow our live website url e.g: https://example.com.au/
How to achieve this? 

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

6 replies

Jagadeesh_Prakash
Community Advisor
Community Advisor
January 9, 2024

If you want to disallow the entire AEM publish domain, you can use the following:

User-agent: *
Disallow: /

 

 

supriya-hande
Level 4
January 10, 2024

hi @jagadeesh_prakash Thanks for your reply.

 

We can disallow everything by just giving /
but how to allow dispatcher url like: https://example.com.au/

I am not getting how to allow public facing urls.

Jagadeesh_Prakash
Community Advisor
Community Advisor
January 10, 2024

@supriya-hande 

If you want to allow crawling only from a specific domain (dispatcher domain) and disallow crawling from other domains (e.g., publisher domain), you can use the Disallow directive to block all paths on the publisher domain and then use the Allow directive for the paths on the dispatcher domain. Here's an example:

 

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /

User-agent: Bingbot
Disallow: /

Allow: /allowed-path/

 

You can see that Dispatcher and publisher will have separate paths. Give it a try it will work

Madhur-Madan
Community Advisor
Community Advisor
January 9, 2024

Hi @supriya-hande ,

Create a Servlet to Serve robots.txt:

  • Create a servlet in AEM that dynamically generates the robots.txt file.
  • This servlet will determine the user agent and serve different content based on it.

Implement Logic in the Servlet:

  • In the servlet logic, check the incoming request's URL.
  • If the request URL matches the publish domain, disallow crawling.
  • For other URLs (like the live website URL), allow crawling.
import org.apache.commons.lang3.StringUtils; import org.apache.sling.api.SlingHttpServletRequest; import org.apache.sling.api.SlingHttpServletResponse; import org.apache.sling.api.servlets.SlingSafeMethodsServlet; import org.osgi.service.component.annotations.Component; import java.io.IOException; import java.io.PrintWriter; @Component( service = {javax.servlet.Servlet.class}, property = { "sling.servlet.resourceTypes=sling/servlet/default", "sling.servlet.selectors=robots", "sling.servlet.extensions=txt" } ) public class RobotsTxtServlet extends SlingSafeMethodsServlet { private static final String LIVE_DOMAIN = "example.com.au"; // Your live website domain private static final String DISALLOW_ALL = "User-agent: *\nDisallow: /"; @Override protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws IOException { response.setContentType("text/plain"); response.setCharacterEncoding("UTF-8"); PrintWriter out = response.getWriter(); String host = request.getHeader("Host"); if (StringUtils.isNotBlank(host) && host.equalsIgnoreCase(LIVE_DOMAIN)) { // Allow crawling for the live website domain out.println("User-agent: *"); out.println("Allow: /"); } else { // Disallow crawling for all other domains out.println(DISALLOW_ALL); } out.close(); } }
Sudheer_Sundalam
Community Advisor
Community Advisor
January 9, 2024

@supriya-hande ,

 

I think the underlaying problem in your use case is incorrect or incomplete externalization of domain URLs. Neither the sitemap.xml nor links within the web page should have direct publish url domain references anywhere.

Re-check the sitemap creation process for correct externalization of URLs. https://experienceleague.adobe.com/docs/experience-manager-65/content/implementing/developing/platform/externalizer.html?lang=en

 

 

supriya-hande
Level 4
January 10, 2024

@sudheer_sundalam We have generated sitemap correctly. I dont see any publish url reference in sitemap.xml 
When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services

Kamal_Kishor
Community Advisor
Community Advisor
January 10, 2024

@supriya-hande : What do you see when you hit sitemap url for any of your site using website url?

For instance, when you hit - https://example.com.au/sitemap.xml  OR  https://example.com/au/en/sitemap.xml do you see URLs with publisher domain? If yes, then you need to fix how you are generating your sitemap.
Else, if you can share how sitemap.xml is being generated for your sites, I can check and provide more details.
Thanks.

supriya-hande
Level 4
January 10, 2024

Hi @kamal_kishor When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services

Sitemap is generating correctly.

Kamal_Kishor
Community Advisor
Community Advisor
January 10, 2024

@supriya-hande : thanks for sharing this info.

I think if you correct your robots.txt, it should resolve your issue as sitemap.xml with site domain are generating correct urls.

Can you try updating your robots.txt to be something like this. Again, the important part here is Sitemap urls. For every site, you can try providing individual URLs OR you can just have one entry if all the sites share same domain and sites are separated based on sub-domains.

User-agent: *
Allow: /
Sitemap: https://www.example.com/site-a/sitemap.xml
Sitemap: https://example.com/site-b/sitemap.xml
Sitemap: https://www.example.com/site-c/sitemap.xml
TarunKumar
Community Advisor
Community Advisor
January 10, 2024

Hi @supriya-hande ,

This is happening because your externalizer is not working properly based on the context of the publish environment.
Check for the below method and try to trouble shoot it.

externalizer.externalLink(resourceResolver, externalizerKey, url);
kautuk_sahni
Community Manager
Community Manager
January 11, 2024

@supriya-hande Did you find the suggestions from users helpful? Please let us know if more information is required. Otherwise, please mark the answer as correct for posterity. If you have found out solution yourself, please share it with the community.

Kautuk Sahni