How to efficiently store/organize 50k nodes in AEM?
As above. We are transferring some content (JSON that will be stored as a string) into AEM as we're looking for ideas. This JSON data will be used as a datasource for AEM pages.
Here's what we've come up so far:
- (my idea) the data has year and a category. I thinking the nodes can be stored/organized this way => /content/my-parent-node/year-here/category-here/my-node-here.
I'm only thinking of going this route because it's easier to navigate in CRXDE. I'm guessing traversing the nodes (programmatically) will be simple was well because it's organized by year and category as all of searches includes only those 2. I do not know if there's any extra lag/penalty by organizing it this way. - (colleague's idea) store every node in the parent folder (example: /content/my-parent-node/my-node-here). He also wants to use a hashed value (MD5/SHA and using year+category+node name combo as input to the MD5 function) as node name.
From what he had told me, he got the idea from Redis (but my quick net search does not mention any advantages to this and no results of Redis being paired with AEM and/or JCR).
A negative (work-wise) for me is that I'll be supporting this setup once the project is finished. It will be extra work for me to get the node content. It will be close to impossible to load/traverse 50k nodes in AEM (specifically CRXDE). We have a folder that has almost 3K nodes and that already takes long time to load in CRXDE.
BUT I do not mind going this route if it's really the better option but he only has provided "trust me it's better" answer.
your thoughts? Thank you.
edit1:
- (colleague's suggestion) - I'm not sure yet how he's going to implement it but based what I've seen his done, he is a fan of SQL2
- datasource = the contents of a single page will be dynamically sourced from the one of the 50k nodes. We will be using suffices to pass parameters to AEM.
example: www.my-host.com/students/2021/inquiry/paige-smith (this will display all the inquiries made by Paige Smith in the year 2021. My suffices for this URL is 2021 + inquiry + paige-smith) - performance is obviously important but as a I said above, I do not know if my or my colleague's suggestion has any bearing to performance as we've never stored that many data/records before. Maybe the performance advantage of 1 over the other doesn't really matter? From what I can gather from @joerghoh reply, it seems to me, single folder with all the nodes in it is better performing?
Based on Google stats, this page (the coldfusion version of it anyway but we're in the process of migrating it to AEM) is one of the organization's top 5 most visited page. - Ordering may not matter. We just need access to the data as quickly and efficiently as possible. If I can access the data easily via CRXDE with very minimal penalty, the better. (only saying this as I just recalled seeing an adobe page that says SQL2 has minimal advantage compared to queries)