How to efficiently store/organize 50k nodes in AEM?

Question

As above. We are transferring some content (JSON that will be stored as a string) into AEM as we're looking for ideas. This JSON data will be used as a datasource for AEM pages.

Here's what we've come up so far:

(my idea) the data has year and a category. I thinking the nodes can be stored/organized this way => /content/my-parent-node/year-here/category-here/my-node-here.

I'm only thinking of going this route because it's easier to navigate in CRXDE. I'm guessing traversing the nodes (programmatically) will be simple was well because it's organized by year and category as all of searches includes only those 2. I do not know if there's any extra lag/penalty by organizing it this way.
(colleague's idea) store every node in the parent folder (example: /content/my-parent-node/my-node-here). He also wants to use a hashed value (MD5/SHA and using year+category+node name combo as input to the MD5 function) as node name.

From what he had told me, he got the idea from Redis (but my quick net search does not mention any advantages to this and no results of Redis being paired with AEM and/or JCR).

A negative (work-wise) for me is that I'll be supporting this setup once the project is finished. It will be extra work for me to get the node content. It will be close to impossible to load/traverse 50k nodes in AEM (specifically CRXDE). We have a folder that has almost 3K nodes and that already takes long time to load in CRXDE.

BUT I do not mind going this route if it's really the better option but he only has provided "trust me it's better" answer.

your thoughts? Thank you.

edit1:

(colleague's suggestion) - I'm not sure yet how he's going to implement it but based what I've seen his done, he is a fan of SQL2
datasource = the contents of a single page will be dynamically sourced from the one of the 50k nodes. We will be using suffices to pass parameters to AEM.

example: www.my-host.com/students/2021/inquiry/paige-smith (this will display all the inquiries made by Paige Smith in the year 2021. My suffices for this URL is 2021 + inquiry + paige-smith)
performance is obviously important but as a I said above, I do not know if my or my colleague's suggestion has any bearing to performance as we've never stored that many data/records before. Maybe the performance advantage of 1 over the other doesn't really matter? From what I can gather from @joerghoh reply, it seems to me, single folder with all the nodes in it is better performing?

Based on Google stats, this page (the coldfusion version of it anyway but we're in the process of migrating it to AEM) is one of the organization's top 5 most visited page.
Ordering may not matter. We just need access to the data as quickly and efficiently as possible. If I can access the data easily via CRXDE with very minimal penalty, the better. (only saying this as I just recalled seeing an adobe page that says SQL2 has minimal advantage compared to queries)

joerghoh · Accepted Answer

Thanks.

So if I understand you correctly, you want to provide the "id" of the node you need to lookup data from as a suffix.

In that case you should be able to deduce from the suffix itself the path of the data node in a performant way. I do not recommend to make a query for it, especially if it's possible to build a path directly from it.

For example:

String DATA_ROOT = "/var/datanodes/";
String suffix = ... // extract the suffix from the path;
// Sanitize the suffix so it cannot be used to traverse to any location in the repo
Resource dataResource = resourceResolver.getResource(DATA_ROOT + suffix);

(That's the simple way, if you want to provide multiple suffix values and not having any path elements in them, you have to have more sanitziing.)

This is a performant way to lookup the data nodes. In every case you will use the node names as an id, and you can build the paths very easily. In the best way to have all your data nodes in a single structure (as siblings using oak:unstructured nodes); while this makes using CRX/DE unusable, it's not a problem from an API point of view (and as long as you are able to avoid any traversal of all these sibling nodes).

Umesh_Thakur · Answer

HI @jayv25585659,Other than above mentioned way, we can have lots of other way as well like, since this will be used as datastore for the page you can use acs genericlist for the same to store all 50k data easily. when needed use your code to traverse it.One more solution will be store the .json file somewhere in the repo. like in dam then read and the file thru code and provide the data where ever it is needed.Or convert the json into .excel and store it somewhere specailly in DAM and read it with the help of APache PIO library . File storage will be a good way in my view.Hope this will help.Umesh Thakur

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded