Expand my Community achievements bar.

SOLVED

Solution Suggestions Request :- A Third party search engine needs to crawl the entire website pages , including the pages behind authentication (Private/Gated Content/Paywalled Content).

Avatar

Community Advisor

Hi Team

 

  So I recently raised a question https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-can-i-enable-private-g... for an issue I was facing. I am rethinking the solution approach I was asked to follow. I would need some expert opinions on the best approach for this issue. 

 

Problem:-   We have a third-party search engine API that needs to crawl all our pages and index them. We have integrated SAML Authentication. Some of the pages will have authentication enabled. We need the Search Engine to crawl these pages too. 

 

Solution Suggested:-  The solution which I was asked to implement is as below

 

  1.  The Search Engine's API request will have a unique User-Agent. For e.g, it will have some XYZ  in the User-Agent.
  2.  We will check the incoming requests each time and if the request is identified from the crawler, allow it (bypass the authentication) to access the pages for indexing.

My thought process was to write a Filter and do the same. But, some SMEs raised a point to me that, if someone knows the User-Agent for this Crawling API, just relying on the User-Agent can be a potential issue. 

 

I am looking for the right or better option for this problem. I am open to all suggestions. 

 

I have a few suggestions from some SMEs which I am looking at currently 

 

1. Sandeep has suggested one way which I think should be one good way to achieve this. Read here 

2. Also checking if this is possible https://stackoverflow.com/a/1382668/8671041 

 

Thanks

Veena ✌

 

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

Hi @Veena_Vikram ,

IP whitelisting along with user authorization would help to achieve the same.


1. Create specific user with required access rights for the third party.
2. Whitelist all IP range for the third party application requests.

3. Third party application will send requests along with user authorization.

4. Based on the IP & user authorization, request will process.

 

0 Replies

Avatar

Community Advisor

Hello @Veena_Vikram 

 

I am not an AEM expert, So I don't know how this can be implemented in AEM.

 

But you already have a solution, Instead of checking the user-agent check the IP of the visitor/crawler. Get a range of IPs from the third-party crawler and add them to an approved list. If a request is coming from those IPs then bypass authentication else asks for authentication.

 

 

Avatar

Community Advisor

we have achieve that other way around.

we had a solution to replicate/post page data to serach engine(Solr) with replication.

So we had no crawling but the data is updating with each page publication.

 

we had a UI to do bulk publishing to Solr as well.

 

I am not sure, if this is possible for you to do it, but another solution can be to use basic authentication for crawler to bypass filter.

Avatar

Community Advisor

Thanks @arunpatidar . We don't need to submit data to the search engine. We need to allow them to crawl all our pages, irrespective of whether it is public or CUG enabled.

Avatar

Correct answer by
Community Advisor

Hi @Veena_Vikram ,

IP whitelisting along with user authorization would help to achieve the same.


1. Create specific user with required access rights for the third party.
2. Whitelist all IP range for the third party application requests.

3. Third party application will send requests along with user authorization.

4. Based on the IP & user authorization, request will process.