Expand my Community achievements bar.

Pages are not getting indexed while crawling using apache nutch

Avatar

Level 2

Hi

 

I am trying to index pages using apache nutch, but the pages having content from external api are not getting indexed. Can someone help me how to resolve this issue.

 

Thanks in advance.

1 Reply

Avatar

Level 4

Hi There,

 

Just to understand a little bit more here, are you saying that external api content is not getting loaded and therefore not available for crawling. And external content api is not getting loaded because the event to load that content is not happening with while crawling is happening.

 

If you are crawling the AEM site for creating search indexes, I would recommend the accepted pattern wherein AEM can push the content to indexer as part of publish replication agenet. This will also help you to sanitize and clean content before sending for indexing purpose. In-fact this is an accepted design solution for use cases where you need to keep AEM content outside in some other systems like Solr etc.

 

Hope it helps!