Expand my Community achievements bar.

Learn about Edge Delivery Services in upcoming GEM session

Facing Issues for Robots.txt in AEM Cloud SDK

Avatar

Level 2

Dear All,

I have setup AEM cloud SDK in my local and trying to implement robots.txt by following the below blog.

 

https://www.aemtutorial.info/2020/07/

 

Here I am facing 2 issues.

 

1) Issue-1 - I have created a file under root content like below.

/content/mysite/robots.txt

When I am trying to see the robots.txt from the page in author/publish like http://localhost:4503/content/mysite/robots.txt , then robots.txt is downloading...

 

2) Issue-2 - When I am hitting the robots.txt from the dispatcher page then also I am not seeing the any content frrom robots.txt , as shown below.

 

sunitaBorn_0-1622106004936.png

 

I am getting below error log in dispatcher it is showing that blocked [publishfarm/-] 0ms "localhost:8082".

 

[27/May/2021:08:39:31 +0000] "GET /content/dam/mysite/robots.txt HTTP/1.1" - blocked [publishfarm/-] 0ms "localhost:8082"
172.17.0.1 "localhost:8082" - [27/May/2021:08:39:31 +0000] "GET /content/dam/mysite/robots.txt HTTP/1.1" 404 196 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
172.17.0.1 "localhost:8082" - [27/May/2021:08:40:25 +0000] "GET /content/mysite/robots.txt HTTP/1.1" 404 196 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
[27/May/2021:08:40:25 +0000] "GET /content/mysite/robots.txt HTTP/1.1" - blocked [publishfarm/-] 0ms "localhost:8082"

 

My robots.txt file is below

#Any search crawler can crawl our site
User-agent: *

#Allow only below mentioned paths
Allow: /en/
Allow: /fr/
Allow: /gb/
Allow: /in/
#Disallow everything else
Disallow: /

 

Can anybody please help me on this. Thanks a lot....NOTE that I am using AEM cloud SDK in my local.

7 Replies

Avatar

Community Advisor

Hi @sunitaBorn 

 

Please see my answers below:

#1: You need to configure the ContentDispositionFilter with the file path as exclusion to open the file instead of downloading it. Please add the below config in config.publish run mode so on all publish instance it will be opened while in author it will be downloaded. If you want in author also to open you can add the same configuration to config it self. But it's only required on publish as publish instances is exposed to public.

org.apache.sling.security.impl.ContentDispositionFilter.xml

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0"
jcr:primaryType="sling:OsgiConfig"
sling.content.disposition.all.paths="{Boolean}false"
sling.content.disposition.excluded.paths="[/content/mysite/robots.txt]"/>

 

#2: I see the path is blocked in "publishfarm" farm file. You need to enable access to the file location i.e. /content/mysite/* to allow the file to load. Ideally there will be a redirect set at the dispatcher because the file will be always accessed like www.website.com/robots.txt and it should serve content from the actual location. So you need to apply the below redirect at the dispatcher as well:

 

RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule (.*) /content/mysite/robots.txt [PT,L]

 

Thanks!

Avatar

Level 2

Thanks @Asutosh_Jena_..Now my issue 1 is fixed and struggling for issue 2.

 

My cloud setup is like below.

 

I have conf.d and conf.dispatcher.d folder as shown in below.

 

sunitaBorn_0-1622121393124.png

 

sunitaBorn_1-1622121447635.png

I have done the below steps and still seeing the below error.

 

 

Step-1 , I have created a mysite.farm like below. 

 

sunitaBorn_2-1622121651517.png

 

Step-2 Then include and include this farm inside enabled_farms like below.

 

../available_farms/default.farm
../available_farms/mysite.farm

 

sunitaBorn_3-1622121824366.png

 

Step-3 Create a mysite.vhost inside avialble_vhosts , as shown below.

sunitaBorn_4-1622122009845.png

 

Step-4 I have added the above 2 lines inside mysite.vhost file, as shown below in bold text.

#
# This is the default publish virtualhost definition for Apache.
#
# DO NOT EDIT this file, your changes will have no impact on your deployment.
#
# Instead create a copy in the folder conf.d/available_vhosts and edit the copy.
# Finally, change to the directory conf.d/enabled_vhosts, remove the symbolic
# link for default.vhost and create a symbolic link to your copy.
#

# Include customer defined variables
Include conf.d/variables/custom.vars

<VirtualHost *:80>
ServerName "publish"
# Put names of which domains are used for your published site/content here
ServerAlias "*"
# Use a document root that matches the one in conf.dispatcher.d/default.farm
DocumentRoot "${DOCROOT}"
# URI dereferencing algorithm is applied at Sling's level, do not decode parameters here
AllowEncodedSlashes NoDecode
# Add header breadcrumbs for help in troubleshooting
<IfModule mod_headers.c>
Header add X-Vhost "publish"
</IfModule>
<Directory />
<IfModule disp_apache2.c>
# Some items cache with the wrong mime type
# Use this option to use the name to auto-detect mime types when cached improperly
ModMimeUsePathInfo On
# Use this option to avoid cache poisioning
# Sling will return /content/image.jpg as well as /content/image.jpg/ but apache can't search /content/image.jpg/ as a file
# Apache will treat that like a directory. This assures the last slash is never stored in cache
DirectorySlash Off
# Enable the dispatcher file handler for apache to fetch files from AEM
SetHandler dispatcher-handler
</IfModule>
Options FollowSymLinks
AllowOverride None
# Insert filter
SetOutputFilter DEFLATE
# Don't compress images
SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip dont-vary
# Prevent clickjacking
Header always append X-Frame-Options SAMEORIGIN
</Directory>
<Directory "${DOCROOT}">
AllowOverride None
Require all granted
</Directory>
<IfModule disp_apache2.c>
# Enabled to allow rewrites to take affect and not be ignored by the dispatcher module
DispatcherUseProcessedURL On
# Default setting to allow all errors to come from the aem instance
DispatcherPassError 0
</IfModule>
<IfModule mod_rewrite.c>
RewriteEngine on
Include conf.d/rewrites/rewrite.rules

# Rewrite index page internally, pass through (PT)
RewriteRule "^(/?)$" "/index.html" [PT]

RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule (.*) /content/myraitt/us/en/robots.txt [PT,L]

</IfModule>
</VirtualHost>

 

Step-5 Also I have changed the below in mysite_rewrite_rules, as shown below

 

#

# Examples:
# This ruleset would look for robots.txt and fetch it from the dam only if the domain is exampleco-dev.adobecqms.net

RewriteCond %{SERVER_NAME} localhost:8080 [NC]
RewriteRule ^/robots.txt$ /content/dam/myraitt/robots.txt [NC,PT]

# This ruleset would look for favicon.ico in exampleco's base dam folder if the domain is exampleco-brand1-dev.adobecqms.net

 

sunitaBorn_5-1622122218893.png

 

Still I am getting below error blocked by publishfarm , as shown below...Anything I am missing here.

 


172.17.0.1 "localhost:8080" - [27/May/2021:13:12:36 +0000] "GET /content/dam/myraitt/robots.txt HTTP/1.1" 404 196 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
[27/May/2021:13:12:36 +0000] "GET /content/dam/myraitt/robots.txt HTTP/1.1" - blocked [publishfarm/-] 0ms "localhost:8080"

 

Avatar

Community Advisor

Hi @sunitaBorn 

 

Can you please share the file content from filters folder? The path is blocked which is why you are not able to access it.

Avatar

Level 2

Hi @Asutosh_Jena_..Please see my file from filters folder.

 

sunitaBorn_1-1622127552933.png

 

 

*********************default-filters.any*************************

#
# This is the default filter ACL specifying what requests are handled by the dispatcher.
#
# DO NOT EDIT this file, your changes will have no impact on your deployment.
#
# Instead modify filters.any.
#

# deny everything and allow specific entries
# Start with everything blocked as a safeguard and open things customers need and what's safe OOTB
/0001 { /type "deny" /url "*" }

# Open consoles if this isn't a production environment by uncommenting the next few lines
# /002 { /type "allow" /url "/crx/*" } # allow content repository
# /003 { /type "allow" /url "/system/*" } # allow OSGi console

# allow non-public content directories if this isn't a production environment by uncommenting the next few lines
# /004 { /type "allow" /url "/apps/*" } # allow apps access
# /005 { /type "allow" /url "/bin/*" } # allow bin path access

# This rule allows content to be access
/0010 { /type "allow" /extension '(css|eot|gif|ico|jpeg|jpg|js|gif|pdf|png|svg|swf|ttf|woff|woff2|html)' /path "/content/*" }
# disable this rule to allow mapped content only

# Enable specific mime types in non-public content directories
/0011 { /type "allow" /method "GET" /extension '(css|eot|gif|ico|jpeg|jpg|js|gif|png|svg|swf|ttf|woff|woff2)' }

# Enable clientlibs proxy servlet
/0012 { /type "allow" /method "GET" /url "/etc.clientlibs/*" }

# Enable basic features
/0013 { /type "allow" /method "GET" /url '/libs/granite/csrf/token.json' /extension 'json' } # AEM provides a framework aimed at preventing Cross-Site Request Forgery attacks
/0014 { /type "allow" /method "POST" /url "/content/*.form.html" } # allow POSTs to form selectors under content

/0015 { /type "allow" /method "GET" /path "/libs/cq/personalization" } # enable personalization
/0016 { /type "allow" /method "POST" /path "/content/*.commerce.cart.json" } # allow POSTs to update the shopping cart

# Deny content grabbing for greedy queries and prevent un-intended self DOS attacks
/0017 { /type "deny" /selectors '(feed|rss|pages|languages|blueprint|infinity|tidy|sysview|docview|query|[0-9-]+|jcr:content)' /extension '(json|xml|html|feed)' }

# Deny authoring query params
/0018 { /type "deny" /method "GET" /query "debug=*" }
/0019 { /type "deny" /method "GET" /query "wcmmode=*" }

# Allow current user
/0020 { /type "allow" /url "/libs/granite/security/currentuser.json" }

# Allow index page
/0030 { /type "allow" /url "/index.html" }

# Allow IMS Authentication
/0031 { /type "allow" /method "GET" /url "/callback/j_security_check" }

# AEM Forms specific filters
# to allow AF specific endpoints for prefill, submit and sign
/0032 { /type "allow" /path "/content/forms/af/*" /method "POST" /selectors '(submit|internalsubmit|agreement|signSubmit|prefilldata)' /extension '(jsp|json)' }

# to allow AF specific endpoints for thank you page
/0033 { /type "allow" /path "/content/forms/af/*" /method "GET" /selectors '(guideThankYouPage|guideAsyncThankYouPage)' /extension '(html)'}

# to allow AF specific endpoints for lazy loading
/0034 { /type "allow" /path "/content/forms/af/*" /method "GET" /extension '(jsonhtmlemitter)'}

# to allow fp related functionalities
/0035 { /type "allow" /path "/content/forms/*" /selectors '(fp|attach|draft|dor|api)' /extension '(html|jsp|json|pdf)' }

# to allow forms access via dam path
/0036 { /type "allow" /path "/content/dam/formsanddocuments/**/jcr:content" /method "GET"}

# to allow invoke service functionality (FDM)
/0037 { /type "allow" /path "/content/forms/*" /selectors '(af)' /extension '(dermis)' }

# AEM Screens Filters
# to allow AEM Screens channels selectors
/0050 { /type "allow" /method "GET" /url "/screens/channels.json" }

# to allow AEM Screens Content and selectors
/0051 { /type "allow" /method '(GET|HEAD)' /url "/content/screens/*" }

# AEM Sites Filters
# to allow site30 theme servlet
/0052 { /type "allow" /extension "theme" /path "/content/*" }

# Allow GraphQL & preflight requests
# GraphQL also supports "GET" requests, if you intend to use "GET" add a rule in filters.any
/0060 { /type "allow" /method '(POST|OPTIONS)' /url "/content/_cq_graphql/*/endpoint.json" }

# GraphQL Persisted Queries & preflight requests
/0061 { /type "allow" /method '(GET|POST|OPTIONS)' /url "/graphql/execute.json*" }

 

*******************filters.any*******************

#
# This file contains the filter ACL, and can be customized.
#
# By default, it includes the default filter ACL.
#

$include "./default_filters.any"

#/006 { /type "allow" /url "/content/myraitt/us/en/robots.txt"} # allow robots.txt path access
#/007 { /type "allow" /url "/robots.txt"}
#/008 { /type "allow" /url "/content/dam/myraitt/robots.txt"}

/0010 { /type "allow" /extension '(css|eot|gif|ico|jpeg|jpg|js|gif|pdf|png|svg|swf|ttf|woff|woff2|html|txt)' /path "/content/*" }

Avatar

Community Advisor

HI @sunitaBorn Please apply the below rule in filter.any file between rule 0008 and 0010.

/0009 { /glob "/content/*.txt" /type "allow" }

Restart the dispatcher and test it.

 

Thanks!

Avatar

Level 2

Hi @Asutosh_Jena_,

 

When I used  /0009 { /glob "/content/*.txt" /type "allow" } in filter.any , then I am getting below error.

 

Cloud manager validator 2.0.30
2021/06/01 16:20:38 Dispatcher configuration validation failed:
conf.dispatcher.d\filters\filters.any:9: filter must not use glob pattern to allow requests

 

 

sunitaBorn_0-1622544802529.png

 

When I commented  /009 { /glob "/content/*.txt" /type "allow" } and used /009 { /type "allow" /extension '(txt)' /path "/content/myraitt/*" } then I am able to build success and I am able to see my robots.txt with 200 , as shown below.

 

sunitaBorn_1-1622545415948.png

 

sunitaBorn_2-1622545438769.png