Issues while getting page content | Community
Skip to main content
August 3, 2023
Solved

Issues while getting page content

  • August 3, 2023
  • 1 reply
  • 760 views

Hello guys, 

I have the following situation here: 

I'm trying to obtain the content of each page, in order to index it to solr documents.This will be used later, so when searching on the site for a word, the client will get the page which contains the text inside page body, for example.

 

Now, the problem is that, while reindexing the content to solr (which means reading page content), I'm getting a certain desynchronization between solr and what the page is actually containing, more exactly: multiple solr documents, contains the same content.

Even more details:
-> for this implementation I used : SlingRequestProcessor and SlingInternalRequest and they do not have any possibility for synchronization (like wait for response) - SlingInternalRequest seems offer more options for synchronization, but is just an appearance.

-> the actual result looks something like this: 

  • the result (page content) coming from the request is used for multiple solr documents, instead of each request with it's own content, to be mapped to appropriate solr document

 

My questions are:

  •  What are you using guys, for reading page content of a page (html format) ?
  • Are there other implementations (Sling) which offer possibility of synchronization, or wait for response ?

 

 

 

 

 

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by dariuspuscas
 Shashi_Mulug, thank you for responding !

Yes, is the second case.my question is not about the mechanism, but how to obtain the page content, in an iterative manner (if you have 100 pages, you have to make 100 http request to the instance), safely.

it seems that using SlingRequestProcessor and SlingInternalRequest does not work properly, because they make the request in hit and run manner.

 

So the solution that I found is make external requests using java.net.http.HttpClient.

It offers method that waits for the answer: https://docs.oracle.com/en/java/javase/11/docs/api/java.net.http/java/net/http/HttpClient.html#send(java.net.http.HttpRequest,java.net.http.HttpResponse.BodyHandler)

1 reply

Shashi_Mulugu
Community Advisor
Community Advisor
August 4, 2023

@dariuspuscas are you using solr as a replacement to AEM internal Lucene Index or as external integrated search server for web search?

If second case, you need to generate feed from aem on a regular basis could be a daily feed and in that feed xml populate all the required data from the page be page title, description, content etc.. and then use this feedxml to import to solr daily..

dariuspuscasAuthorAccepted solution
August 11, 2023
 Shashi_Mulug, thank you for responding !

Yes, is the second case.my question is not about the mechanism, but how to obtain the page content, in an iterative manner (if you have 100 pages, you have to make 100 http request to the instance), safely.

it seems that using SlingRequestProcessor and SlingInternalRequest does not work properly, because they make the request in hit and run manner.

 

So the solution that I found is make external requests using java.net.http.HttpClient.

It offers method that waits for the answer: https://docs.oracle.com/en/java/javase/11/docs/api/java.net.http/java/net/http/HttpClient.html#send(java.net.http.HttpRequest,java.net.http.HttpResponse.BodyHandler)