Expand my Community achievements bar.

Guidelines for the Responsible Use of Generative AI in the Experience Cloud Community.
SOLVED

How can I enable private/gated content crawling in AEM ?

Avatar

Community Advisor

Hi Team

 

  We have a third-party search engine that will crawl the pages in AEM, index them, and then provide the search results. There are a few pages that are behind authentication. We would need the crawler to crawl these pages. We can achieve this by skipping the SAML authentication for these pages. I tried to write some Servlet Filters, so that, the request will come to Filter before authentication. As per https://github.com/Adobe-Consulting-Services/acs-aem-samples/issues/63 this feature no more works in the latest AEM versions. I am looking for a better solution.

 

Update:- When I say behind authentication, it is the OOTB Authentication Enabled checkbox in the page properties.

 

My requirement 

 

1. Third-party Search engine will crawl the pages.

2. It will have an identifier in the User Agent which will allow us to identify if the request is from the crawler or not.

3. If the User Agent has the identifier, I need to skip the SAML Authenticator and allow the Service to crawl the page.

4. If the User Agent does not have the identifier, send the request to SAML Authentication. 

 

My initial thought was to write a Filter and achieve this, but it doesn't seem to be working. Any help is appreciated. If you need more information, I can provide the same.

 

Thanks

Veena ✌

1 Accepted Solution

Avatar

Correct answer by
Community Advisor
7 Replies

Avatar

Community Advisor

Hi @VeenaVikraman 

I think you can try to have Rewrite option in the dispatcher by allowing and blocking the useragent.

Please find the below example which blocks the useragent from robots.txt

https://httpd.apache.org/docs/2.4/rewrite/access.html

You can similar kind of approach to pass or redirect to Login

 

Hope this is helpful.

Reference for blocking the request.

Manikumar_0-1670461223131.png

 

Avatar

Community Advisor

Thanks @Mani_kumar_ . The issues I am having right now is before dispatcher . I am testing this in my localhost right now. I want the request to come to the filter so that I can check the User Agent value and then do the needful. But before that , the SAMLAuthenticator kicks in and redirects me to the login page. 

Ideally when the crawling service API try to hit this private page in AEM , we should let it crawl. So when that request comes to AEM , I should position the filter or any other service before authentication . I am just wondering what would be the right way to do the same . 

Avatar

Community Advisor

Hi @VeenaVikraman 
You to extend the SAMLAuthenticator then. Probably need to implement SAMLAuthenticator interface with your own extended SAMLAuthenticatorImpl extension.

 

 



Arun Patidar

Avatar

Community Advisor

Thanks @arunpatidar . I also thought that might work. Can you direct me to some sample code if it is available ?

Avatar

Correct answer by
Community Advisor

Avatar

Adobe Champion

Just putting my thoughts

 

1. As these pages are behind authentication,AEM always look for a valid login-token for allowing the page to render.

2. You may write something (a placeholder component can trigger a servlet ) to generate a valid token whenever you identify a special request header or user agent  

3. Skipping the valid authentication process always not recommendable that to for a BOT. (in my view)

Avatar

Level 7

Putting a thought

1. Set up a robots.txt file for your AEM site to prevent crawlers from accessing private/gated content.

2. Configure the Apache Sling Authentication service to require authentication for all requests.

3. Create a custom servlet that will respond to requests from crawlers and serve them the appropriate content.

4. Create a custom authentication handler to handle authentication requests from crawlers.

5. Set up an access control list (ACL) to specify which content can be accessed by crawlers.

6. Monitor your sites access logs to identify any suspicious requests or crawler activity.