Level 4

Disallow AEM publish domain url in robots.txt file

Forum|Forum|2 years ago
January 9, 2024
6 replies
3524 views

Hello Everyone,

I am following this url to create robots.txt file in prod author: https://www.aemtutorial.info/2020/07/robotstxt-file-in-aem-websites.html

For our website publish urls are getting crawled and displayed in Google Search console. So we want to disallow publish domain url in robots.txt file and only allow our live website url e.g: https://example.com.au/
How to achieve this?

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

Jagadeesh_Prakash

Community Advisor

If you want to disallow the entire AEM publish domain, you can use the following:

User-agent: *
Disallow: /

supriya-handeAuthor

Level 4

hi @jagadeesh_prakash Thanks for your reply.

We can disallow everything by just giving /
but how to allow dispatcher url like: https://example.com.au/

I am not getting how to allow public facing urls.

Jagadeesh_Prakash

Community Advisor

@supriya-hande

If you want to allow crawling only from a specific domain (dispatcher domain) and disallow crawling from other domains (e.g., publisher domain), you can use the Disallow directive to block all paths on the publisher domain and then use the Allow directive for the paths on the dispatcher domain. Here's an example:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /

User-agent: Bingbot
Disallow: /

Allow: /allowed-path/

You can see that Dispatcher and publisher will have separate paths. Give it a try it will work

Madhur-Madan

Community Advisor

Hi @supriya-hande ,

Create a Servlet to Serve robots.txt:

Create a servlet in AEM that dynamically generates the robots.txt file.
This servlet will determine the user agent and serve different content based on it.

Implement Logic in the Servlet:

In the servlet logic, check the incoming request's URL.
If the request URL matches the publish domain, disallow crawling.
For other URLs (like the live website URL), allow crawling.

import org.apache.commons.lang3.StringUtils;
import org.apache.sling.api.SlingHttpServletRequest;
import org.apache.sling.api.SlingHttpServletResponse;
import org.apache.sling.api.servlets.SlingSafeMethodsServlet;
import org.osgi.service.component.annotations.Component;

import java.io.IOException;
import java.io.PrintWriter;

@Component(
        service = {javax.servlet.Servlet.class},
        property = {
                "sling.servlet.resourceTypes=sling/servlet/default",
                "sling.servlet.selectors=robots",
                "sling.servlet.extensions=txt"
        }
)
public class RobotsTxtServlet extends SlingSafeMethodsServlet {

    private static final String LIVE_DOMAIN = "example.com.au"; // Your live website domain
    private static final String DISALLOW_ALL = "User-agent: *\nDisallow: /";

    @Override
    protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws IOException {
        response.setContentType("text/plain");
        response.setCharacterEncoding("UTF-8");

        PrintWriter out = response.getWriter();
        String host = request.getHeader("Host");

        if (StringUtils.isNotBlank(host) && host.equalsIgnoreCase(LIVE_DOMAIN)) {
            // Allow crawling for the live website domain
            out.println("User-agent: *");
            out.println("Allow: /");
        } else {
            // Disallow crawling for all other domains
            out.println(DISALLOW_ALL);
        }

        out.close();
    }
}

Sudheer_Sundalam

Community Advisor

@supriya-hande ,

I think the underlaying problem in your use case is incorrect or incomplete externalization of domain URLs. Neither the sitemap.xml nor links within the web page should have direct publish url domain references anywhere.

Re-check the sitemap creation process for correct externalization of URLs. https://experienceleague.adobe.com/docs/experience-manager-65/content/implementing/developing/platform/externalizer.html?lang=en

supriya-handeAuthor

Level 4

@sudheer_sundalam We have generated sitemap correctly. I dont see any publish url reference in sitemap.xml
When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services

Kamal_Kishor

Community Advisor

@supriya-hande : What do you see when you hit sitemap url for any of your site using website url?

For instance, when you hit - https://example.com.au/sitemap.xml OR https://example.com/au/en/sitemap.xml do you see URLs with publisher domain? If yes, then you need to fix how you are generating your sitemap.
Else, if you can share how sitemap.xml is being generated for your sites, I can check and provide more details.
Thanks.

supriya-handeAuthor

Level 4

Hi @kamal_kishor When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services

Sitemap is generating correctly.

Kamal_Kishor

Community Advisor

@supriya-hande : thanks for sharing this info.

I think if you correct your robots.txt, it should resolve your issue as sitemap.xml with site domain are generating correct urls.

Can you try updating your robots.txt to be something like this. Again, the important part here is Sitemap urls. For every site, you can try providing individual URLs OR you can just have one entry if all the sites share same domain and sites are separated based on sub-domains.

User-agent: *
Allow: /
Sitemap: https://www.example.com/site-a/sitemap.xml
Sitemap: https://example.com/site-b/sitemap.xml
Sitemap: https://www.example.com/site-c/sitemap.xml

TarunKumar

Community Advisor

Hi @supriya-hande ,

This is happening because your externalizer is not working properly based on the context of the publish environment.
Check for the below method and try to trouble shoot it.

externalizer.externalLink(resourceResolver, externalizerKey, url);

kautuk_sahni

Community Manager

@supriya-hande Did you find the suggestions from users helpful? Please let us know if more information is required. Otherwise, please mark the answer as correct for posterity. If you have found out solution yourself, please share it with the community.

Kautuk Sahni

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded