Your achievements

Level 1

0% to

Level 2

Tip /
Sign in

Sign in to Community

to gain points, level up, and earn exciting badges like the new
Bedrock Mission!

Learn more

View all

Sign in to view all badges

SOLVED

Indexing the whole site to SOLR

Nithyasri_K
Level 2
Level 2

Currently we are trying to index our site (with 40K pages) using a scheduler Job to SOLR server.

I have tried web page scraping using HTMLUnit and Jsoup, but both approaches take 10+s to form the required model object to be sent to SOLR.

I was able to form the model object using ModelExporter (getting jcr:content as JSON) within 1s. This works fine for single page. But when I run using scheduler (which iterates over the pages), it takes 2-3s. 

so the full site indexing takes 24 hours.

 

Does anyone has any idea on how to do this optimally or any AEM server activity which can speed this up ?

1 Accepted Solution
Jörg_Hoh
Correct answer by
Employee
Employee

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.

View solution in original post

3 Replies
Nithyasri_K
Level 2
Level 2

Hi @Kiran_Vedantam , our old approach (before using model exporter) is from the above links. This took 8s for get the page data. hence moved to model exporter.

Jörg_Hoh
Correct answer by
Employee
Employee

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.

View solution in original post