Expand my Community achievements bar.

SOLVED

Indexing the whole site to SOLR

Avatar

Level 2

Currently we are trying to index our site (with 40K pages) using a scheduler Job to SOLR server.

I have tried web page scraping using HTMLUnit and Jsoup, but both approaches take 10+s to form the required model object to be sent to SOLR.

I was able to form the model object using ModelExporter (getting jcr:content as JSON) within 1s. This works fine for single page. But when I run using scheduler (which iterates over the pages), it takes 2-3s. 

so the full site indexing takes 24 hours.

 

Does anyone has any idea on how to do this optimally or any AEM server activity which can speed this up ?

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.

View solution in original post

3 Replies

Avatar

Level 2

Hi @Kiran_Vedantam , our old approach (before using model exporter) is from the above links. This took 8s for get the page data. hence moved to model exporter.

Avatar

Correct answer by
Employee Advisor

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.