We have a third-party search engine that will crawl the pages in AEM, index them, and then provide the search results. There are a few pages that are behind authentication. We would need the crawler to crawl these pages. We can achieve this by skipping the SAML authentication for these pages. I tried to write some Servlet Filters, so that, the request will come to Filter before authentication. As per https://github.com/Adobe-Consulting-Services/acs-aem-samples/issues/63 this feature no more works in the latest AEM versions. I am looking for a better solution.
Update:- When I say behind authentication, it is the OOTB Authentication Enabled checkbox in the page properties.
1. Third-party Search engine will crawl the pages.
2. It will have an identifier in the User Agent which will allow us to identify if the request is from the crawler or not.
3. If the User Agent has the identifier, I need to skip the SAML Authenticator and allow the Service to crawl the page.
4. If the User Agent does not have the identifier, send the request to SAML Authentication.
My initial thought was to write a Filter and achieve this, but it doesn't seem to be working. Any help is appreciated. If you need more information, I can provide the same.
Solved! Go to Solution.
I think you can try to have Rewrite option in the dispatcher by allowing and blocking the useragent.
Please find the below example which blocks the useragent from robots.txt
You can similar kind of approach to pass or redirect to Login
Hope this is helpful.
Reference for blocking the request.
Thanks @Mani_kumar_ . The issues I am having right now is before dispatcher . I am testing this in my localhost right now. I want the request to come to the filter so that I can check the User Agent value and then do the needful. But before that , the SAMLAuthenticator kicks in and redirects me to the login page.
Ideally when the crawling service API try to hit this private page in AEM , we should let it crawl. So when that request comes to AEM , I should position the filter or any other service before authentication . I am just wondering what would be the right way to do the same .
Just putting my thoughts
1. As these pages are behind authentication,AEM always look for a valid login-token for allowing the page to render.
2. You may write something (a placeholder component can trigger a servlet ) to generate a valid token whenever you identify a special request header or user agent
3. Skipping the valid authentication process always not recommendable that to for a BOT. (in my view)
Putting a thought
1. Set up a robots.txt file for your AEM site to prevent crawlers from accessing private/gated content.
2. Configure the Apache Sling Authentication service to require authentication for all requests.
3. Create a custom servlet that will respond to requests from crawlers and serve them the appropriate content.
4. Create a custom authentication handler to handle authentication requests from crawlers.
5. Set up an access control list (ACL) to specify which content can be accessed by crawlers.
6. Monitor your site’s access logs to identify any suspicious requests or crawler activity.