Expand my Community achievements bar.

Guidelines for the Responsible Use of Generative AI in the Experience Cloud Community.
SOLVED

robots.txt is throwing 404 on cloud dispatcher

Avatar

Level 6

Hi Folks,

 

I have a requirement to have robots.txt file at site root.

Everything is working fine in local SDK publish instance. I am following below link

https://www.aemtutorial.info/2020/07/robotstxt-file-in-aem-websites.html

 

robots.txt is placed under

/content/dam/test-project/robots.txt

 

Rule to shorten the url is added in resource resolver configuration

/content/dam/test-project/robots.txt:/robots.txt

 

Rule added in dispatcher to allow

/0073 { /type "allow"  /url "/robots.txt"}

 

When I am accessing the robots.txt on cloud dispatcher, it is throwing 404.

 

Any help is highly appreciated.

 

Thanks,

Pradeep

1 Accepted Solution

Avatar

Correct answer by
Level 4

Hi @pradeepdubey82 

You said it works on your local SDK. Does it work with direct publish request (http://localhost:4503/robots.txt) or with local dispatcher (http://localhost/robots.txt) or with local dispatcher + actual host mapping in /etc/hosts with the same URL as it is requested from your CDN (http://<website.com>/robots.txt)?

Do not forget to check the publish access when you are not logged in to the publisher, it might be an issue with the anonymous user access restrictions!

If the first doesn't work locally (if works only with /content/dam/...), then your sling resolver mapping configuration is not correct (or you may need to change the order of rules), use the jcrresolver interface to check the configuration and fix it.

If it works well for the first but not the second one, then dispatcher allow or rewrite rules are not good. If the first and second are working, but not the third one, then the issue is still in the dispatcher in per-host configuration files.

If all three are working, you can check the same but on your remote - direct access on the publisher (https://adobecqms...:4503/robots.txt) will tell you if it's something incorrect with your AEM / access rules for the anonymous user or still something on CDN / dispatcher.

If it is still the dispatcher, you can try increasing the log output level for the rewrite (and maybe access) logs on the dispatcher and check logs after that, there should be a visible explanation of why the request has been blocked.

Btw, if you have the rule

RewriteRule ^/robots.txt$ /content/dam/test-project/robots.txt [NC,PT,L]

then you don't need to update mappings in resolver configuration, the localhost:4503/robots.txt won't work but requests through the dispatcher will work anyway.

View solution in original post

12 Replies

Avatar

Employee

Hi @pradeepdubey82 

First you need to check the dispatcher log to see if request is going to Publish instance or not.

if request is getting blocked by the dispatcher then there must be some rule after your dispatcher rule that is blocking robots.txt url. may be ".txt" extension is getting blocked by the dispatcher.

If dispatcher is sending the request to publisher the issue is with JCR resource resolver configuration.

Check if that is getting deployed properly or there is some issue with OSGI  configuration file.

Avatar

Community Advisor

Hi @pradeepdubey82 Either this path (/content/dam/...) is incorrect or your file is not published. All rules seem right.

 

Thanks

-Bilal

Avatar

Level 6

On repo browser in cloud publisher I can see robots.txt file is there of type nt:file

I can see this error on dispatcher log.

"GET /robots.txt" - 0ms [publishfarm/-] [actionblocked] publish-

 

Where should I enable this ? In which file any idea

 

Avatar

Employee

check your dispatcher filter rules if any filter rule is blocking /robots.txt url  after your allow rule.

Avatar

Community Advisor

Hey @pradeepdubey82 Try to find if it is getting blocked in your custom filter(check the extensions if it is allowed). Also try to play around with other properties like /method or /path.

 

Thanks

-Bilal

Avatar

Level 6

I have searched entire codebase for txt or .txt in dispatcher files, nowhere it is blocking.

Now checking other configurations like method/extension/path etc in filter configurations.

 

Avatar

Community Advisor

@pradeepdubey82 

 

Try updating your filter rule to the following and check:

 

/0073 { /type "allow"  /url "/content/dam/test-project/robots.txt"}

 

Avatar

Level 6

Added rules in filter file

/0074 { /type "allow" /url "/robots.txt"}

 

Added rule in rewrite file

RewriteRule ^/robots.txt$ /content/dam/test-project/robots.txt [NC,PT]

 

No luck.

 

Please advise if any other place am I missing?

 

Avatar

Community Advisor

@pradeepdubey82 ,

 

First check if you can access it fine from direct publish url: 

http://publishserver:port/content/dam/test-project/robots.txt. If yes, then update your filter to below with RewriteRule in place.

 

Add rules in filter file

Either 

     /0074 { /type "allow" /url "/content/dam/test-project/robots.txt"} 

Or

     /0074 { /type "allow" /url "*/robots.txt"} -- just for testing.

 

update rule in rewrite file

RewriteRule ^/robots.txt$ /content/dam/test-project/robots.txt [NC,PT,L]

 

If you have access to dispatcher.log file, change the log level to debug and check the logs to make sure the dispatcher is not Rejecting this robots.txt path.

 

Also, clear the cache on dispatcher every time you make a change in dispatcher/apache rules especially with modifications in filter rules.

 

 

Avatar

Level 6

Still getting below error in dispatcher log and 404 from browser with full path or shorten path.

 

"GET /content/dam/test-project/robots.txt" - 0ms [publishfarm/-] [actionblocked]

 

Rewrite rules applied below

RewriteRule ^/robots.txt$ /content/dam/test-project/robots.txt [NC,PT,L]

 

Filter rules applied below

/0072 { /type "allow" /url "/content/dam/test-project/robots.txt"}

 

 

 

 

Avatar

Correct answer by
Level 4

Hi @pradeepdubey82 

You said it works on your local SDK. Does it work with direct publish request (http://localhost:4503/robots.txt) or with local dispatcher (http://localhost/robots.txt) or with local dispatcher + actual host mapping in /etc/hosts with the same URL as it is requested from your CDN (http://<website.com>/robots.txt)?

Do not forget to check the publish access when you are not logged in to the publisher, it might be an issue with the anonymous user access restrictions!

If the first doesn't work locally (if works only with /content/dam/...), then your sling resolver mapping configuration is not correct (or you may need to change the order of rules), use the jcrresolver interface to check the configuration and fix it.

If it works well for the first but not the second one, then dispatcher allow or rewrite rules are not good. If the first and second are working, but not the third one, then the issue is still in the dispatcher in per-host configuration files.

If all three are working, you can check the same but on your remote - direct access on the publisher (https://adobecqms...:4503/robots.txt) will tell you if it's something incorrect with your AEM / access rules for the anonymous user or still something on CDN / dispatcher.

If it is still the dispatcher, you can try increasing the log output level for the rewrite (and maybe access) logs on the dispatcher and check logs after that, there should be a visible explanation of why the request has been blocked.

Btw, if you have the rule

RewriteRule ^/robots.txt$ /content/dam/test-project/robots.txt [NC,PT,L]

then you don't need to update mappings in resolver configuration, the localhost:4503/robots.txt won't work but requests through the dispatcher will work anyway.

Avatar

Level 6

Solution was, in cloud dev deployment changes specially dispatcher were not taking place, when we deploy it to stage it started working.

Cloud behaves weird that is difficult to understand why.

Thanks all for looking into it.

 

Cheers,

Pradeep