Expand my Community achievements bar.

Join us in celebrating the outstanding achievement of our AEM Community Member of the Year!
SOLVED

How to efficiently store/organize 50k nodes in AEM?

Avatar

Level 9

As above. We are transferring some content (JSON that will be stored as a string) into AEM as we're looking for ideas. This JSON data will be used as a datasource for AEM pages.

 

Here's what we've come up so far:

 

  • (my idea) the data has year and a category. I thinking the nodes can be stored/organized this way => /content/my-parent-node/year-here/category-here/my-node-here.

    I'm only thinking of going this route because it's easier to navigate in CRXDE. I'm guessing traversing the nodes (programmatically) will  be simple was well because it's organized by year and category as all of searches includes only those 2. I do not know if there's any extra lag/penalty by organizing it this way.

  • (colleague's idea) store every node in the parent folder (example: /content/my-parent-node/my-node-here). He also wants to use a hashed value (MD5/SHA and using year+category+node name combo as input to the MD5 function) as node name.

    From what he had told me, he got the idea from Redis (but my quick net search does not mention any advantages to this and no results of Redis being paired with AEM and/or JCR).

    A negative (work-wise) for me is that I'll be supporting this setup once the project is finished. It will be extra work for me to get the node content. It will be close to impossible to load/traverse 50k nodes in AEM (specifically CRXDE). We have a folder that has almost 3K nodes and that already takes long time to load in CRXDE.

    BUT I do not mind going this route if it's really the better option but he only has provided "trust me it's better" answer.

 

your thoughts? Thank you.

 

edit1:

  • (colleague's suggestion) - I'm not sure yet how he's going to implement it but based what I've seen his done, he is a fan of SQL2

  • datasource = the contents of a single page will be dynamically sourced from the one of the 50k nodes. We will be using suffices to pass parameters to AEM. 

    example: www.my-host.com/students/2021/inquiry/paige-smith (this will display all the inquiries made by Paige Smith in the year 2021. My suffices for this URL is 2021 + inquiry + paige-smith)

  • performance is obviously important but as a I said above, I do not know if my or my colleague's suggestion has any bearing to performance as we've never stored that many data/records before. Maybe the performance advantage of 1 over the other doesn't really matter? From what I can gather from @Jörg_Hoh  reply, it seems to me, single folder with all the nodes in it is better performing?

    Based on Google stats, this page (the coldfusion version of it anyway but we're in the process of migrating it to AEM) is one of the organization's top 5 most visited page.

  • Ordering may not matter. We just need access to the data as quickly and efficiently as possible. If I can access the data easily via CRXDE with very minimal penalty, the better. (only saying this as I just recalled seeing an adobe page that says SQL2 has minimal advantage compared to queries)
1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

Thanks. 

 

So if I understand you correctly, you want to provide the "id" of the node you need to lookup data from as a suffix. 

In that case you should be able to deduce from the suffix itself the path of the data node in a performant way. I do not recommend to make a query for it, especially if it's possible to build a path directly from it.

For example:

 

String DATA_ROOT = "/var/datanodes/";
String suffix = ... // extract the suffix from the path;
// Sanitize the suffix so it cannot be used to traverse to any location in the repo
Resource dataResource = resourceResolver.getResource(DATA_ROOT + suffix);

(That's the simple way, if you want to provide multiple suffix values and not having any path elements in them, you have to have more sanitziing.)

 

This is a performant way to lookup the data nodes. In every case you will use the node names as an id, and you can build the paths very easily. In the best way to have all your data nodes in a single structure (as siblings using oak:unstructured nodes); while this makes using CRX/DE unusable, it's not a problem from an API point of view (and as long as you are able to avoid any traversal of all these sibling nodes).

 

View solution in original post

5 Replies

Avatar

Community Advisor

HI @jayv25585659,

Other than above mentioned way, we can have lots of other way as well like, since this will be used as datastore for the page you can use acs genericlist for the same to store all 50k data easily. when needed use your code to traverse it.

One more solution will be store the .json file somewhere in the repo. like in dam then read and the file thru code and provide the data where ever it is needed.

Or convert the json into .excel and store it somewhere specailly in DAM and read it with the help of APache PIO library .

 

File storage will be a good way in my view.

Hope this will help.

Umesh Thakur

Avatar

Employee Advisor

What do you mean with "datasource for pages"? Is this is product information system usecase or something similar?

 

Generally, the the structure by a multitude of parameters. While access control doesn't sound like a important one in this case, I wonder about the access pattern. Do you need to have random access to each of these 50k nodes or are they always handled all in one iteration? In case of random access do you look them up by a fixed "name" (a path? Are there duplicates?) or do you need to search by some value within these nodes? Does ordering matter? How often do you update? Do you need to handle duplicat

In the end it's always a matter of priorities: Is a performant lookup by the system important? Or is it more important that you can navigate them by CRX/DE?

 

If navigating with CRX/DE is important, you should try to limit yourself to ~500 nodes folder. If you choose to store them as "oak:unstructured" you loose ordering but gain performance when adding/removing nodes. 

 

 

Avatar

Level 9

I've edited my original question hoping it should give better context to my question. Thank you.

Avatar

Correct answer by
Employee Advisor

Thanks. 

 

So if I understand you correctly, you want to provide the "id" of the node you need to lookup data from as a suffix. 

In that case you should be able to deduce from the suffix itself the path of the data node in a performant way. I do not recommend to make a query for it, especially if it's possible to build a path directly from it.

For example:

 

String DATA_ROOT = "/var/datanodes/";
String suffix = ... // extract the suffix from the path;
// Sanitize the suffix so it cannot be used to traverse to any location in the repo
Resource dataResource = resourceResolver.getResource(DATA_ROOT + suffix);

(That's the simple way, if you want to provide multiple suffix values and not having any path elements in them, you have to have more sanitziing.)

 

This is a performant way to lookup the data nodes. In every case you will use the node names as an id, and you can build the paths very easily. In the best way to have all your data nodes in a single structure (as siblings using oak:unstructured nodes); while this makes using CRX/DE unusable, it's not a problem from an API point of view (and as long as you are able to avoid any traversal of all these sibling nodes).