Hey All,
I am trying to figure out the best way to export pages and their content into a 3rd party search company. The data needs to look like this :
{
"op": "add",
"path": "/path/to/page",
"value": {
"attributes": {
"title": "Page Title",
"url": "https://www.website.com/path/to/page.html",
"description": "This is basiclly all of the content on the page. So if there is 2 different text area's on the page it should put that content inside this description."
}
}
}
I know I can ask the page for everything including a description, but that doesn't account for things on the page. For example, lets say I put a new text component on the page and I added text that I want searched on, then it wouldn't pull that data. I started to look into jSoup (and httpclient since I am a SPA), to crawl the page, but is that the best option?
Thanks for anyone who has an opinion
Solved! Go to Solution.
Views
Replies
Total Likes
This is only good if you are running outside of AEM. I realized a couple of things...one of them was using HTMLUnit and Jsoup to scrape and parse. I decided to use content services and the json model to parse.
Hi @Sean-McK
You can use the AEM JSON Exporter functionality that comes OOTB from core components to expose the content as a service (JSON).
Hope it helps!
Thanks,
Kiran Vedantam
Hey Kiran,
Thanks for the request. I think the issue with that option is finding out what components and what fields to pull. For example, in a text component the field name is text (easy), but what if I create a custom component? So lets say my custom component has 3 fields (title, subtitle, and botteminfo)...now when I do that json I will have to know to bring in those 3 fields? I guess I can create a config that takes field names to "index" and when it gets the page it can look for all those field names
Hi @Sean-McK
In your scenario, you have two options to consider: the Push model and the Pull model. The Push model involves sending updated data to the search service upon page publication or using EventHandler from AEM. On the other hand, the Pull model involves configuring a crawler at the search service side, which periodically(ex 1 hour) retrieves data from AEM.
We faced similiar kind of scenario. We wanted to search on our websites but not using AEM.
So we used typesense and scrapy to get it done.
Scrapy scraps all the data that are available on web pages and store it on typesense.
This is only good if you are running outside of AEM. I realized a couple of things...one of them was using HTMLUnit and Jsoup to scrape and parse. I decided to use content services and the json model to parse.