Adobe Experience Manager Sites & More

VeenaVikraman · 12/8/22

Hi Team

So I recently raised a question https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-can-i-enable-private-g... for an issue I was facing. I am rethinking the solution approach I was asked to follow. I would need some expert opinions on the best approach for this issue.

Problem:- We have a third-party search engine API that needs to crawl all our pages and index them. We have integrated SAML Authentication. Some of the pages will have authentication enabled. We need the Search Engine to crawl these pages too.

Solution Suggested:- The solution which I was asked to implement is as below

The Search Engine's API request will have a unique User-Agent. For e.g, it will have some XYZ in the User-Agent.
We will check the incoming requests each time and if the request is identified from the crawler, allow it (bypass the authentication) to access the pages for indexing.

My thought process was to write a Filter and do the same. But, some SMEs raised a point to me that, if someone knows the User-Agent for this Crawling API, just relying on the User-Agent can be a potential issue.

I am looking for the right or better option for this problem. I am open to all suggestions.

I have a few suggestions from some SMEs which I am looking at currently

1. Sandeep has suggested one way which I think should be one good way to achieve this. Read here

2. Also checking if this is possible https://stackoverflow.com/a/1382668/8671041

Thanks

Veena ✌

BrijeshYadav · 12/9/22

Hi @VeenaVikraman ,

IP whitelisting along with user authorization would help to achieve the same.

1. Create specific user with required access rights for the third party.
2. Whitelist all IP range for the third party application requests.

3. Third party application will send requests along with user authorization.

4. Based on the IP & user authorization, request will process.

View solution in original post

_Manoj_Kumar_ · 12/8/22

Hello @VeenaVikraman

I am not an AEM expert, So I don't know how this can be implemented in AEM.

But you already have a solution, Instead of checking the user-agent check the IP of the visitor/crawler. Get a range of IPs from the third-party crawler and add them to an approved list. If a request is coming from those IPs then bypass authentication else asks for authentication.

Manoj
Find me on LinkedIn

arunpatidar · 12/9/22

we have achieve that other way around.

we had a solution to replicate/post page data to serach engine(Solr) with replication.

So we had no crawling but the data is updating with each page publication.

we had a UI to do bulk publishing to Solr as well.

I am not sure, if this is possible for you to do it, but another solution can be to use basic authentication for crawler to bypass filter.

Arun Patidar

VeenaVikraman · 12/11/22

Thanks @arunpatidar . We don't need to submit data to the search engine. We need to allow them to crawl all our pages, irrespective of whether it is public or CUG enabled.

BrijeshYadav · 12/9/22

Hi @VeenaVikraman ,

IP whitelisting along with user authorization would help to achieve the same.

1. Create specific user with required access rights for the third party.
2. Whitelist all IP range for the third party application requests.

3. Third party application will send requests along with user authorization.

4. Based on the IP & user authorization, request will process.

Adobe Experience Manager Sites & More

Solution Suggestions Request :- A Third party search engine needs to crawl the entire website pages , including the pages behind authentication (Private/Gated Content/Paywalled Content).

Arun Patidar

Learn

Documentation

Events

Community

Support

Resources

Adobe account

Adobe