Indexing the whole site to SOLR | Community
Skip to main content
Level 2
September 27, 2021
Solved

Indexing the whole site to SOLR

  • September 27, 2021
  • 2 replies
  • 1108 views

Currently we are trying to index our site (with 40K pages) using a scheduler Job to SOLR server.

I have tried web page scraping using HTMLUnit and Jsoup, but both approaches take 10+s to form the required model object to be sent to SOLR.

I was able to form the model object using ModelExporter (getting jcr:content as JSON) within 1s. This works fine for single page. But when I run using scheduler (which iterates over the pages), it takes 2-3s. 

so the full site indexing takes 24 hours.

 

Does anyone has any idea on how to do this optimally or any AEM server activity which can speed this up ?

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by joerghoh

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.

2 replies

Kiran_Vedantam
Community Advisor
Community Advisor
September 27, 2021
Level 2
September 27, 2021

Hi @kiran_vedantam , our old approach (before using model exporter) is from the above links. This took 8s for get the page data. hence moved to model exporter.

joerghoh
Adobe Employee
joerghohAdobe EmployeeAccepted solution
Adobe Employee
September 27, 2021

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.