Expand my Community achievements bar.

Guidelines for the Responsible Use of Generative AI in the Experience Cloud Community.
SOLVED

Exporting Content into 3rd party search

Avatar

Level 5

Hey All,

 

I am trying to figure out the best way to export pages and their content into a 3rd party search company. The data needs to look like this :

 

{
  "op": "add",
  "path": "/path/to/page",
  "value": {
    "attributes": {
      "title": "Page Title",
      "url": "https://www.website.com/path/to/page.html",
      "description": "This is basiclly all of the content on the page. So if there is 2 different text area's on the page it should put that content inside this description."
    }
  }
}

 

 

I know I can ask the page for everything including a description, but that doesn't account for things on the page. For example, lets say I put a new text component on the page and I added text that I want searched on, then it wouldn't pull that data. I started to look into jSoup (and httpclient since I am a SPA), to crawl the page, but is that the best option?

 

Thanks for anyone who has an opinion

1 Accepted Solution

Avatar

Correct answer by
Level 5

This is only good if you are running outside of AEM. I realized a couple of things...one of them was using HTMLUnit and Jsoup to scrape and parse. I decided to use content services and the json model to parse. 

View solution in original post

5 Replies

Avatar

Community Advisor

Hi @Sean-McK 

 

You can use the AEM JSON Exporter functionality that comes OOTB from core components to expose the content as a service (JSON).

 

https://experienceleague.adobe.com/docs/experience-manager-64/developing/components/json-exporter.ht....

 

Hope it helps!

Thanks,
Kiran Vedantam

Avatar

Level 5

Hey Kiran,

 

Thanks for the request. I think the issue with that option is finding out what components and what fields to pull. For example, in a text component the field name is text (easy), but what if I create a custom component? So lets say my custom component has 3 fields (title, subtitle, and botteminfo)...now when I do that json I will have to know to bring in those 3 fields? I guess I can create a config that takes field names to "index" and when it gets the page it can look for all those field names

 

Avatar

Community Advisor

Hi @Sean-McK 

In your scenario, you have two options to consider: the Push model and the Pull model. The Push model involves sending updated data to the search service upon page publication  or using EventHandler from AEM. On the other hand, the Pull model involves configuring a crawler at the search service side, which periodically(ex 1 hour) retrieves data from AEM.

Avatar

Level 6

We faced similiar kind of scenario. We wanted to search on our websites but not using AEM.

So we used typesense and scrapy to get it done. 
Scrapy scraps all the data that are available on web pages and store it on typesense.

 

Avatar

Correct answer by
Level 5

This is only good if you are running outside of AEM. I realized a couple of things...one of them was using HTMLUnit and Jsoup to scrape and parse. I decided to use content services and the json model to parse.