Hello Everyone,
I am following this url to create robots.txt file in prod author: https://www.aemtutorial.info/2020/07/robotstxt-file-in-aem-websites.html
For our website publish urls are getting crawled and displayed in Google Search console. So we want to disallow publish domain url in robots.txt file and only allow our live website url e.g: https://example.com.au/
How to achieve this?
Views
Replies
Total Likes
If you want to disallow the entire AEM publish domain, you can use the following:
User-agent: *
Disallow: /
hi @Jagadeesh_Prakash Thanks for your reply.
We can disallow everything by just giving /
but how to allow dispatcher url like: https://example.com.au/
I am not getting how to allow public facing urls.
If you want to allow crawling only from a specific domain (dispatcher domain) and disallow crawling from other domains (e.g., publisher domain), you can use the Disallow directive to block all paths on the publisher domain and then use the Allow directive for the paths on the dispatcher domain. Here's an example:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /
User-agent: Bingbot
Disallow: /
Allow: /allowed-path/
You can see that Dispatcher and publisher will have separate paths. Give it a try it will work
hi @Jagadeesh_Prakash How to allow domain url in robots.txt?
I want to allow everything from dispatcher and disallow everything from publish URL.
@supriya-hande Seems domains can not be mentioned in robots.txt So was giving the path based solution.
Hi @supriya-hande ,
Create a Servlet to Serve robots.txt:
Implement Logic in the Servlet:
import org.apache.commons.lang3.StringUtils;
import org.apache.sling.api.SlingHttpServletRequest;
import org.apache.sling.api.SlingHttpServletResponse;
import org.apache.sling.api.servlets.SlingSafeMethodsServlet;
import org.osgi.service.component.annotations.Component;
import java.io.IOException;
import java.io.PrintWriter;
@Component(
service = {javax.servlet.Servlet.class},
property = {
"sling.servlet.resourceTypes=sling/servlet/default",
"sling.servlet.selectors=robots",
"sling.servlet.extensions=txt"
}
)
public class RobotsTxtServlet extends SlingSafeMethodsServlet {
private static final String LIVE_DOMAIN = "example.com.au"; // Your live website domain
private static final String DISALLOW_ALL = "User-agent: *\nDisallow: /";
@Override
protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws IOException {
response.setContentType("text/plain");
response.setCharacterEncoding("UTF-8");
PrintWriter out = response.getWriter();
String host = request.getHeader("Host");
if (StringUtils.isNotBlank(host) && host.equalsIgnoreCase(LIVE_DOMAIN)) {
// Allow crawling for the live website domain
out.println("User-agent: *");
out.println("Allow: /");
} else {
// Disallow crawling for all other domains
out.println(DISALLOW_ALL);
}
out.close();
}
}
I think the underlaying problem in your use case is incorrect or incomplete externalization of domain URLs. Neither the sitemap.xml nor links within the web page should have direct publish url domain references anywhere.
Re-check the sitemap creation process for correct externalization of URLs. https://experienceleague.adobe.com/docs/experience-manager-65/content/implementing/developing/platfo...
@Sudheer_Sundalam We have generated sitemap correctly. I dont see any publish url reference in sitemap.xml
When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services
@supriya-hande : What do you see when you hit sitemap url for any of your site using website url?
For instance, when you hit - https://example.com.au/sitemap.xml OR https://example.com/au/en/sitemap.xml do you see URLs with publisher domain? If yes, then you need to fix how you are generating your sitemap.
Else, if you can share how sitemap.xml is being generated for your sites, I can check and provide more details.
Thanks.
Hi @Kamal_Kishor When I hit sitemap url I can see urls like: https://example.com.au/products/our-products-and-services
Sitemap is generating correctly.
@supriya-hande : thanks for sharing this info.
I think if you correct your robots.txt, it should resolve your issue as sitemap.xml with site domain are generating correct urls.
Can you try updating your robots.txt to be something like this. Again, the important part here is Sitemap urls. For every site, you can try providing individual URLs OR you can just have one entry if all the sites share same domain and sites are separated based on sub-domains.
User-agent: * Allow: / Sitemap: https://www.example.com/site-a/sitemap.xml
Sitemap: https://example.com/site-b/sitemap.xml
Sitemap: https://www.example.com/site-c/sitemap.xml
First and foremost, as a best practice, all of your CQ5 author and publish servers be put behind a firewall, not publicly accessible. Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.
Please refer the original source: https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/google-indexing-site-pages...
Hi @supriya-hande ,
This is happening because your externalizer is not working properly based on the context of the publish environment.
Check for the below method and try to trouble shoot it.
externalizer.externalLink(resourceResolver, externalizerKey, url);
@supriya-hande Did you find the suggestions from users helpful? Please let us know if more information is required. Otherwise, please mark the answer as correct for posterity. If you have found out solution yourself, please share it with the community.
Views
Replies
Total Likes
Views
Likes
Replies