Expand my Community achievements bar.

Don’t miss the AEM Skill Exchange in SF on Nov 14—hear from industry leaders, learn best practices, and enhance your AEM strategy with practical tips.

Disallow AEM publish domain url in robots.txt file

Avatar

Level 4

Hello Everyone,

 

I am following this url to create robots.txt file in prod author: https://www.aemtutorial.info/2020/07/robotstxt-file-in-aem-websites.html

For our website publish urls are getting crawled and displayed in Google Search console. So we want to disallow publish domain url  in robots.txt file and only allow our live website url e.g: https://example.com.au/
How to achieve this? 

14 Replies

Avatar

Community Advisor

If you want to disallow the entire AEM publish domain, you can use the following:

User-agent: *
Disallow: /

 

 

Avatar

Level 4

hi @Jagadeesh_Prakash Thanks for your reply.

 

We can disallow everything by just giving /
but how to allow dispatcher url like: https://example.com.au/

I am not getting how to allow public facing urls.

Avatar

Community Advisor

@supriya-hande 

If you want to allow crawling only from a specific domain (dispatcher domain) and disallow crawling from other domains (e.g., publisher domain), you can use the Disallow directive to block all paths on the publisher domain and then use the Allow directive for the paths on the dispatcher domain. Here's an example:

 

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /

User-agent: Bingbot
Disallow: /

Allow: /allowed-path/

 

You can see that Dispatcher and publisher will have separate paths. Give it a try it will work

Avatar

Level 4

hi @Jagadeesh_Prakash How to allow domain url in robots.txt?
I want to allow everything from dispatcher and disallow everything from publish URL.

Avatar

Community Advisor

@supriya-hande Seems domains can not be mentioned in robots.txt So was giving the path based solution. 

Avatar

Community Advisor

Hi @supriya-hande ,

Create a Servlet to Serve robots.txt:

  • Create a servlet in AEM that dynamically generates the robots.txt file.
  • This servlet will determine the user agent and serve different content based on it.

Implement Logic in the Servlet:

  • In the servlet logic, check the incoming request's URL.
  • If the request URL matches the publish domain, disallow crawling.
  • For other URLs (like the live website URL), allow crawling.
import org.apache.commons.lang3.StringUtils;
import org.apache.sling.api.SlingHttpServletRequest;
import org.apache.sling.api.SlingHttpServletResponse;
import org.apache.sling.api.servlets.SlingSafeMethodsServlet;
import org.osgi.service.component.annotations.Component;

import java.io.IOException;
import java.io.PrintWriter;

@Component(
        service = {javax.servlet.Servlet.class},
        property = {
                "sling.servlet.resourceTypes=sling/servlet/default",
                "sling.servlet.selectors=robots",
                "sling.servlet.extensions=txt"
        }
)
public class RobotsTxtServlet extends SlingSafeMethodsServlet {

    private static final String LIVE_DOMAIN = "example.com.au"; // Your live website domain
    private static final String DISALLOW_ALL = "User-agent: *\nDisallow: /";

    @Override
    protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws IOException {
        response.setContentType("text/plain");
        response.setCharacterEncoding("UTF-8");

        PrintWriter out = response.getWriter();
        String host = request.getHeader("Host");

        if (StringUtils.isNotBlank(host) && host.equalsIgnoreCase(LIVE_DOMAIN)) {
            // Allow crawling for the live website domain
            out.println("User-agent: *");
            out.println("Allow: /");
        } else {
            // Disallow crawling for all other domains
            out.println(DISALLOW_ALL);
        }

        out.close();
    }
}

Avatar

Community Advisor

@supriya-hande ,

 

I think the underlaying problem in your use case is incorrect or incomplete externalization of domain URLs. Neither the sitemap.xml nor links within the web page should have direct publish url domain references anywhere.

Re-check the sitemap creation process for correct externalization of URLs. https://experienceleague.adobe.com/docs/experience-manager-65/content/implementing/developing/platfo...

 

 

Avatar

Level 4

@Sudheer_Sundalam We have generated sitemap correctly. I dont see any publish url reference in sitemap.xml 
When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services

Avatar

Community Advisor

@supriya-hande : What do you see when you hit sitemap url for any of your site using website url?

For instance, when you hit - https://example.com.au/sitemap.xml  OR  https://example.com/au/en/sitemap.xml do you see URLs with publisher domain? If yes, then you need to fix how you are generating your sitemap.
Else, if you can share how sitemap.xml is being generated for your sites, I can check and provide more details.
Thanks.

Avatar

Community Advisor

@supriya-hande : thanks for sharing this info.

I think if you correct your robots.txt, it should resolve your issue as sitemap.xml with site domain are generating correct urls.

Can you try updating your robots.txt to be something like this. Again, the important part here is Sitemap urls. For every site, you can try providing individual URLs OR you can just have one entry if all the sites share same domain and sites are separated based on sub-domains.

User-agent: *
Allow: /
Sitemap: https://www.example.com/site-a/sitemap.xml
Sitemap: https://example.com/site-b/sitemap.xml
Sitemap: https://www.example.com/site-c/sitemap.xml

Avatar

Community Advisor
First and foremost, as a best practice,  all of your CQ5 author and publish servers be put behind a firewall, not publicly accessible. Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.

Please refer the original source: https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/google-indexing-site-pages...

Avatar

Community Advisor

Hi @supriya-hande ,

This is happening because your externalizer is not working properly based on the context of the publish environment.
Check for the below method and try to trouble shoot it.

externalizer.externalLink(resourceResolver, externalizerKey, url);

Avatar

Administrator

@supriya-hande Did you find the suggestions from users helpful? Please let us know if more information is required. Otherwise, please mark the answer as correct for posterity. If you have found out solution yourself, please share it with the community.



Kautuk Sahni