Solution Suggestions Request :- A Third party search engine needs to crawl the entire website pages , including the pages behind authentication (Private/Gated Content/Paywalled Content). | Community
Skip to main content
VeenaVikraman
Community Advisor
Community Advisor
December 9, 2022
Solved

Solution Suggestions Request :- A Third party search engine needs to crawl the entire website pages , including the pages behind authentication (Private/Gated Content/Paywalled Content).

  • December 9, 2022
  • 3 replies
  • 1336 views

Hi Team

 

  So I recently raised a question https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-can-i-enable-private-gated-content-crawling-in-aem/td-p/562158 for an issue I was facing. I am rethinking the solution approach I was asked to follow. I would need some expert opinions on the best approach for this issue. 

 

Problem:-   We have a third-party search engine API that needs to crawl all our pages and index them. We have integrated SAML Authentication. Some of the pages will have authentication enabled. We need the Search Engine to crawl these pages too. 

 

Solution Suggested:-  The solution which I was asked to implement is as below

 

  1.  The Search Engine's API request will have a unique User-Agent. For e.g, it will have some XYZ  in the User-Agent.
  2.  We will check the incoming requests each time and if the request is identified from the crawler, allow it (bypass the authentication) to access the pages for indexing.

My thought process was to write a Filter and do the same. But, some SMEs raised a point to me that, if someone knows the User-Agent for this Crawling API, just relying on the User-Agent can be a potential issue. 

 

I am looking for the right or better option for this problem. I am open to all suggestions. 

 

I have a few suggestions from some SMEs which I am looking at currently 

 

1. Sandeep has suggested one way which I think should be one good way to achieve this. Read here 

2. Also checking if this is possible https://stackoverflow.com/a/1382668/8671041 

 

Thanks

Veena ✌

 

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by BrijeshYadav

Hi @veenavikraman ,

IP whitelisting along with user authorization would help to achieve the same.


1. Create specific user with required access rights for the third party.
2. Whitelist all IP range for the third party application requests.

3. Third party application will send requests along with user authorization.

4. Based on the IP & user authorization, request will process.

 

3 replies

Manoj_Kumar
Community Advisor
Community Advisor
December 9, 2022

Hello @veenavikraman 

 

I am not an AEM expert, So I don't know how this can be implemented in AEM.

 

But you already have a solution, Instead of checking the user-agent check the IP of the visitor/crawler. Get a range of IPs from the third-party crawler and add them to an approved list. If a request is coming from those IPs then bypass authentication else asks for authentication.

 

 

Manoj  | https://themartech.pro
arunpatidar
Community Advisor
Community Advisor
December 9, 2022

we have achieve that other way around.

we had a solution to replicate/post page data to serach engine(Solr) with replication.

So we had no crawling but the data is updating with each page publication.

 

we had a UI to do bulk publishing to Solr as well.

 

I am not sure, if this is possible for you to do it, but another solution can be to use basic authentication for crawler to bypass filter.

Arun Patidar
VeenaVikraman
Community Advisor
Community Advisor
December 12, 2022

Thanks @arunpatidar . We don't need to submit data to the search engine. We need to allow them to crawl all our pages, irrespective of whether it is public or CUG enabled.

BrijeshYadav
BrijeshYadavAccepted solution
Level 5
December 9, 2022

Hi @veenavikraman ,

IP whitelisting along with user authorization would help to achieve the same.


1. Create specific user with required access rights for the third party.
2. Whitelist all IP range for the third party application requests.

3. Third party application will send requests along with user authorization.

4. Based on the IP & user authorization, request will process.