Adobe Experience Manager Sites & More

supriya-hande · 1/9/24

Hello Everyone,

I am following this url to create robots.txt file in prod author: https://www.aemtutorial.info/2020/07/robotstxt-file-in-aem-websites.html

For our website publish urls are getting crawled and displayed in Google Search console. So we want to disallow publish domain url in robots.txt file and only allow our live website url e.g: https://example.com.au/
How to achieve this?

Jagadeesh_Prakash · 1/9/24

If you want to disallow the entire AEM publish domain, you can use the following:

User-agent: *
Disallow: /

supriya-hande · 1/9/24

hi @Jagadeesh_Prakash Thanks for your reply.

We can disallow everything by just giving /
but how to allow dispatcher url like: https://example.com.au/

I am not getting how to allow public facing urls.

Jagadeesh_Prakash · 1/9/24

@supriya-hande

If you want to allow crawling only from a specific domain (dispatcher domain) and disallow crawling from other domains (e.g., publisher domain), you can use the Disallow directive to block all paths on the publisher domain and then use the Allow directive for the paths on the dispatcher domain. Here's an example:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /

User-agent: Bingbot
Disallow: /

Allow: /allowed-path/

You can see that Dispatcher and publisher will have separate paths. Give it a try it will work

supriya-hande · 1/10/24

hi @Jagadeesh_Prakash How to allow domain url in robots.txt?
I want to allow everything from dispatcher and disallow everything from publish URL.

Jagadeesh_Prakash · 1/10/24

@supriya-hande Seems domains can not be mentioned in robots.txt So was giving the path based solution.

Madhur-Madan · 1/9/24

Hi @supriya-hande ,

Create a Servlet to Serve robots.txt:

Create a servlet in AEM that dynamically generates the robots.txt file.
This servlet will determine the user agent and serve different content based on it.

Implement Logic in the Servlet:

In the servlet logic, check the incoming request's URL.
If the request URL matches the publish domain, disallow crawling.
For other URLs (like the live website URL), allow crawling.

import org.apache.commons.lang3.StringUtils;
import org.apache.sling.api.SlingHttpServletRequest;
import org.apache.sling.api.SlingHttpServletResponse;
import org.apache.sling.api.servlets.SlingSafeMethodsServlet;
import org.osgi.service.component.annotations.Component;

import java.io.IOException;
import java.io.PrintWriter;

@Component(
        service = {javax.servlet.Servlet.class},
        property = {
                "sling.servlet.resourceTypes=sling/servlet/default",
                "sling.servlet.selectors=robots",
                "sling.servlet.extensions=txt"
        }
)
public class RobotsTxtServlet extends SlingSafeMethodsServlet {

    private static final String LIVE_DOMAIN = "example.com.au"; // Your live website domain
    private static final String DISALLOW_ALL = "User-agent: *\nDisallow: /";

    @Override
    protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws IOException {
        response.setContentType("text/plain");
        response.setCharacterEncoding("UTF-8");

        PrintWriter out = response.getWriter();
        String host = request.getHeader("Host");

        if (StringUtils.isNotBlank(host) && host.equalsIgnoreCase(LIVE_DOMAIN)) {
            // Allow crawling for the live website domain
            out.println("User-agent: *");
            out.println("Allow: /");
        } else {
            // Disallow crawling for all other domains
            out.println(DISALLOW_ALL);
        }

        out.close();
    }
}

Sudheer_Sundalam · 1/9/24

@supriya-hande ,

I think the underlaying problem in your use case is incorrect or incomplete externalization of domain URLs. Neither the sitemap.xml nor links within the web page should have direct publish url domain references anywhere.

Re-check the sitemap creation process for correct externalization of URLs. https://experienceleague.adobe.com/docs/experience-manager-65/content/implementing/developing/platfo...

supriya-hande · 1/10/24

@Sudheer_Sundalam We have generated sitemap correctly. I dont see any publish url reference in sitemap.xml
When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services

Kamal_Kishor · 1/9/24

@supriya-hande : What do you see when you hit sitemap url for any of your site using website url?

For instance, when you hit - https://example.com.au/sitemap.xml OR https://example.com/au/en/sitemap.xml do you see URLs with publisher domain? If yes, then you need to fix how you are generating your sitemap.
Else, if you can share how sitemap.xml is being generated for your sites, I can check and provide more details.
Thanks.

supriya-hande · 1/9/24

Hi @Kamal_Kishor When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services

Sitemap is generating correctly.

Kamal_Kishor · 1/10/24

@supriya-hande : thanks for sharing this info.

I think if you correct your robots.txt, it should resolve your issue as sitemap.xml with site domain are generating correct urls.

Can you try updating your robots.txt to be something like this. Again, the important part here is Sitemap urls. For every site, you can try providing individual URLs OR you can just have one entry if all the sites share same domain and sites are separated based on sub-domains.

User-agent: *
Allow: /
Sitemap: https://www.example.com/site-a/sitemap.xml
Sitemap: https://example.com/site-b/sitemap.xml
Sitemap: https://www.example.com/site-c/sitemap.xml

Kamal_Kishor · 1/10/24

First and foremost, as a best practice,  all of your CQ5 author and publish servers be put behind a firewall, not publicly accessible. Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.

Please refer the original source: https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/google-indexing-site-pages...

TarunKumar · 1/9/24

Hi @supriya-hande ,

This is happening because your externalizer is not working properly based on the context of the publish environment.
Check for the below method and try to trouble shoot it.

externalizer.externalLink(resourceResolver, externalizerKey, url);

kautuk_sahni · 1/11/24

@supriya-hande Did you find the suggestions from users helpful? Please let us know if more information is required. Otherwise, please mark the answer as correct for posterity. If you have found out solution yourself, please share it with the community.

Adobe Experience Manager Sites & More

Disallow AEM publish domain url in robots.txt file

Kautuk Sahni

Learn

Documentation

Community

Support

Resources

Adobe account

Adobe