Expand my Community achievements bar.

SOLVED

Issues while getting page content

Avatar

Level 1

Hello guys, 

I have the following situation here: 

I'm trying to obtain the content of each page, in order to index it to solr documents.This will be used later, so when searching on the site for a word, the client will get the page which contains the text inside page body, for example.

 

Now, the problem is that, while reindexing the content to solr (which means reading page content), I'm getting a certain desynchronization between solr and what the page is actually containing, more exactly: multiple solr documents, contains the same content.

Even more details:
-> for this implementation I used : SlingRequestProcessor and SlingInternalRequest and they do not have any possibility for synchronization (like wait for response) - SlingInternalRequest seems offer more options for synchronization, but is just an appearance.

-> the actual result looks something like this: 

  • the result (page content) coming from the request is used for multiple solr documents, instead of each request with it's own content, to be mapped to appropriate solr document

 

My questions are:

  •  What are you using guys, for reading page content of a page (html format) ?
  • Are there other implementations (Sling) which offer possibility of synchronization, or wait for response ?

 

 

 

 

 

1 Accepted Solution

Avatar

Correct answer by
Level 1
 Shashi_Mulug, thank you for responding !

Yes, is the second case.my question is not about the mechanism, but how to obtain the page content, in an iterative manner (if you have 100 pages, you have to make 100 http request to the instance), safely.

it seems that using SlingRequestProcessor and SlingInternalRequest does not work properly, because they make the request in hit and run manner.

 

So the solution that I found is make external requests using java.net.http.HttpClient.

It offers method that waits for the answer: https://docs.oracle.com/en/java/javase/11/docs/api/java.net.http/java/net/http/HttpClient.html#send(...)

View solution in original post

2 Replies

Avatar

Community Advisor

@dariuspuscas are you using solr as a replacement to AEM internal Lucene Index or as external integrated search server for web search?

If second case, you need to generate feed from aem on a regular basis could be a daily feed and in that feed xml populate all the required data from the page be page title, description, content etc.. and then use this feedxml to import to solr daily..

Avatar

Correct answer by
Level 1
 Shashi_Mulug, thank you for responding !

Yes, is the second case.my question is not about the mechanism, but how to obtain the page content, in an iterative manner (if you have 100 pages, you have to make 100 http request to the instance), safely.

it seems that using SlingRequestProcessor and SlingInternalRequest does not work properly, because they make the request in hit and run manner.

 

So the solution that I found is make external requests using java.net.http.HttpClient.

It offers method that waits for the answer: https://docs.oracle.com/en/java/javase/11/docs/api/java.net.http/java/net/http/HttpClient.html#send(...)