Solution Suggestions Request :- A Third party search engine needs to crawl the entire website pages , including the pages behind authentication (Private/Gated Content/Paywalled Content).
Hi Team
So I recently raised a question https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-can-i-enable-private-gated-content-crawling-in-aem/td-p/562158 for an issue I was facing. I am rethinking the solution approach I was asked to follow. I would need some expert opinions on the best approach for this issue.
Problem:- We have a third-party search engine API that needs to crawl all our pages and index them. We have integrated SAML Authentication. Some of the pages will have authentication enabled. We need the Search Engine to crawl these pages too.
Solution Suggested:- The solution which I was asked to implement is as below
- The Search Engine's API request will have a unique User-Agent. For e.g, it will have some XYZ in the User-Agent.
- We will check the incoming requests each time and if the request is identified from the crawler, allow it (bypass the authentication) to access the pages for indexing.
My thought process was to write a Filter and do the same. But, some SMEs raised a point to me that, if someone knows the User-Agent for this Crawling API, just relying on the User-Agent can be a potential issue.
I am looking for the right or better option for this problem. I am open to all suggestions.
I have a few suggestions from some SMEs which I am looking at currently
1. Sandeep has suggested one way which I think should be one good way to achieve this. Read here
2. Also checking if this is possible https://stackoverflow.com/a/1382668/8671041
Thanks
Veena ✌
