Expand my Community achievements bar.

Don’t miss the AEM Skill Exchange in SF on Nov 14—hear from industry leaders, learn best practices, and enhance your AEM strategy with practical tips.
SOLVED

Approach for Storing Huge/larger number of nodes in JCR

Avatar

Level 4

Hi,

I am working on a very peculiar requirement where I am required to store analytical data under JCR as nodes and properties. Scenario is like we are avoiding using Adobe Analytics but we have our own analytics developed in some SAP related tool. So from CQ all we need to do is post analytical data to SAP through some webservice. Here as the requirement suggests the data we are storing will be huge. Every day at some time scheduler will run and post the recorded data from JCR Nodes to SAP.The data which we are storing in nodes is related to the download of executables that user will click and download, details which we are going to record include like username, usertype, executable name, download start time etc. Now the issue is as I understand we can only have 1000 child nodes of a node. So how can I arrange the storing of details in JCR so that it overcomes this 1000 child nodes storage limitation also where to store such records(under etc or content or where). Also wanted to know is there any way to ensure the optimization of retrieval of values from these nodes.

Thanks,

Samir 

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

Hi Samir,

so you collect tracking data inside of AEM and then export it regularly into SAP? In this case you are using the JCR repo as a storage for quite transient data I don't think that this is a good idea from a conceptual point of view.

* Do you want to collect this data on publish systems? Because in that case you will probably store this data on each publish instance; and then your export (or SAP) needs to consolidate this data. Not a real problem, but you might loose data unless you make all publishs high-available.

* You put lot of pressure on the repo. With TarMK the write performance has also improved, but do you really want to store each data point in the repo? Please do a performance test upfront and check your KPIs.

* Your incoming data is not structured at all, so any order doesn't matter. Just use a oak:unstructured node as parent and you're fine. Just don't expect that you can check this folder with CRXDE Lite :-)

I would choose a different approach, maybe setting up a queueing service (eg. RabittMQ),  where each download will be submitted to. Then your AEM instances are stateless again and are not loaded with storing and exporting this transient data. And you have an application which can fetch the datapoints from the queue and feed it directly to SAP (either live or batched, as you like).

Jörg

View solution in original post

13 Replies

Avatar

Employee

Hi Samir,

What version of AEM are you using? Ignoring the use case, the 1000 child node issue was for CRX2, not OAK[0]

If you are using OAK, you would need to create indexes based on the queries you plan to run.

Regards,

Opkar

[0] https://cqdump.wordpress.com/2015/07/09/1000-nodes-per-folder-and-oak-orderable-nodes/

Avatar

Level 4

Opkar Gill wrote...

Hi Samir,

What version of AEM are you using? Ignoring the use case, the 1000 child node issue was for CRX2, not OAK[0]

If you are using OAK, you would need to create indexes based on the queries you plan to run.

Regards,

Opkar

[0] https://cqdump.wordpress.com/2015/07/09/1000-nodes-per-folder-and-oak-orderable-nodes/

 

Hi Opkar,

I am using AEM6.0 but I faced this 1000 nodes issue when creating products in my JCR that's the reason I wanted some feasible approach using which I can save all the records.

Thanks

Avatar

Employee

$@^^!R wrote...

Hi Opkar,

I am using AEM6.0 but I faced this 1000 nodes issue when creating products in my JCR that's the reason I wanted some feasible approach using which I can save all the records.

Thanks

Apologies, but I'm not sure if you are saying you had an issue with 1000 child nodes or you think there is an issue with 1000 child nodes.

When you have such a large data set, you need to do some kind of bucketing to reduce the number of nodes that have a large number of bucketing. Usually this is based on some property within the data or can be based on the time: e.g. day, month year. But it really depends on your data. Also note, if there is ever any chance the data will be browsed by a human, having thousands of child nodes is a very poor user experience.

Regards,

Opkar

Avatar

Level 4

Opkar Gill wrote...

$@^^!R wrote...

Hi Opkar,

I am using AEM6.0 but I faced this 1000 nodes issue when creating products in my JCR that's the reason I wanted some feasible approach using which I can save all the records.

Thanks

Apologies, but I'm not sure if you are saying you had an issue with 1000 child nodes or you think there is an issue with 1000 child nodes.

When you have such a large data set, you need to do some kind of bucketing to reduce the number of nodes that have a large number of bucketing. Usually this is based on some property within the data or can be based on the time: e.g. day, month year. But it really depends on your data. Also note, if there is ever any chance the data will be browsed by a human, having thousands of child nodes is a very poor user experience.

Regards,

Opkar

 

Hi Opkar,

I already faced that 1000 child nodes issue once in a requirement where I was required to create products node under /etc. So ultimately we distributed the products by creating nodes of their starting alphabet and so on thereby overcoming this 1000 chilnode problem. This time again same kind of requirement came so just wanted to know if there is any better way like if other CQ people are doing in some other better way.

Regards,

Samir

Avatar

Level 5

Hi Samir,

Please elaborate on the 1000 node issue you are facing in AEM 6.x, Please share some error log or additional details like performance issue,random issue etc .This gives an exact context of the issue

And what is your total product node size

Are you using any OOTB e-commerce importers or is this a custom product node ?

Regards

Sri

Avatar

Level 9

Hi Samir,

In my personal view, I think, the 1k child nodes issue is less important than how the data modeling should be done. If you do data modeling correctly and considering reading & write operation, Things will be more clear. I have a few questions here.

 Every day at some time scheduler will run and post the recorded data from JCR Nodes to SAP

In which format, you are going to provide JCR Node data to SAP?.

we are going to record include like username, usertype, executable name, download start time etc.

How are you going to record this information?.

 is there any way to ensure the optimization of retrieval of values from these nodes

I would say yes, if you analytics data are organized correctly. So, first of all, follow the hierarchy based on date, time or some other info. Now, if data is huge and you want to get all data recorded in a particular day, then organize our data in such a way that the query should be done for fewer nodes. For that, Date & Time could be the key.

Let me know if I am clear here.

Jitendra

Avatar

Correct answer by
Employee Advisor

Hi Samir,

so you collect tracking data inside of AEM and then export it regularly into SAP? In this case you are using the JCR repo as a storage for quite transient data I don't think that this is a good idea from a conceptual point of view.

* Do you want to collect this data on publish systems? Because in that case you will probably store this data on each publish instance; and then your export (or SAP) needs to consolidate this data. Not a real problem, but you might loose data unless you make all publishs high-available.

* You put lot of pressure on the repo. With TarMK the write performance has also improved, but do you really want to store each data point in the repo? Please do a performance test upfront and check your KPIs.

* Your incoming data is not structured at all, so any order doesn't matter. Just use a oak:unstructured node as parent and you're fine. Just don't expect that you can check this folder with CRXDE Lite :-)

I would choose a different approach, maybe setting up a queueing service (eg. RabittMQ),  where each download will be submitted to. Then your AEM instances are stateless again and are not loaded with storing and exporting this transient data. And you have an application which can fetch the datapoints from the queue and feed it directly to SAP (either live or batched, as you like).

Jörg

Avatar

Level 4

Jörg Hoh wrote...

Hi Samir,

so you collect tracking data inside of AEM and then export it regularly into SAP? In this case you are using the JCR repo as a storage for quite transient data I don't think that this is a good idea from a conceptual point of view.

* Do you want to collect this data on publish systems? Because in that case you will probably store this data on each publish instance; and then your export (or SAP) needs to consolidate this data. Not a real problem, but you might loose data unless you make all publishs high-available.

* You put lot of pressure on the repo. With TarMK the write performance has also improved, but do you really want to store each data point in the repo? Please do a performance test upfront and check your KPIs.

* Your incoming data is not structured at all, so any order doesn't matter. Just use a oak:unstructured node as parent and you're fine. Just don't expect that you can check this folder with CRXDE Lite :-)

I would choose a different approach, maybe setting up a queueing service (eg. RabittMQ),  where each download will be submitted to. Then your AEM instances are stateless again and are not loaded with storing and exporting this transient data. And you have an application which can fetch the datapoints from the queue and feed it directly to SAP (either live or batched, as you like).

Jörg

 

 

Hi Jorg,

That was a great explanation.Never thought that it will have so much to impact.

Yes, this is for the publish instance and as of now we have only one publish instance and we will not worry about multiple publish instance coming in future so that way we are easy.

 I have never used the queuing service which you suggested so don't know how to implement that. Went through some online documentations where they are using com.rabbitmq.client API as in this tutorial https://www.rabbitmq.com/tutorials/tutorial-two-java.html but couldn't make it work.

In case I am unable to implement this by the queuing way like you suggested, can I think about implementing it as a listener by creating a node for storing all the reporting data under it and giving my listener the path of that node to listen to ? Even if we implement scheduler and trigger it daily to report the data to SAP and immediately after reporting we delete all these nodes, will that help (hoping we are allowed to remove this data after reporting)? So this way we will not be having too much data to worry about.

Also if you can suggest where to store these data(downloadfilename, username, iscompanyemployee, dateofdownload, timeofdownload, groupofuserdownloading etc) in JCR ? like /etc/ or somewhere else based on the impact.

Thanks & Regards,

Samir

Avatar

Level 4

Jitendra S.Tomar wrote...

Hi Samir,

In my personal view, I think, the 1k child nodes issue is less important than how the data modeling should be done. If you do data modeling correctly and considering reading & write operation, Things will be more clear. I have a few questions here.

 Every day at some time scheduler will run and post the recorded data from JCR Nodes to SAP

In which format, you are going to provide JCR Node data to SAP?.

we are going to record include like username, usertype, executable name, download start time etc.

How are you going to record this information?.

 is there any way to ensure the optimization of retrieval of values from these nodes

I would say yes, if you analytics data are organized correctly. So, first of all, follow the hierarchy based on date, time or some other info. Now, if data is huge and you want to get all data recorded in a particular day, then organize our data in such a way that the query should be done for fewer nodes. For that, Date & Time could be the key.

Let me know if I am clear here.

Jitendra

 

 

Hi Jitendra,

To your questions on the requirement :

In which format, you are going to provide JCR Node data to SAP?.

The format probably will be XML as we are planning to post the data to a SOAP webservice. So I am assuming ultimately we need to convert the data in node's properties to XML to form the request body and post it.

How are you going to record this information?.

Recording here simply means storing that in JCR. Whenever user clicks on our file to download we will simply gather some details about the user like name, group, date, time, city, country, filename etc and persist it in JCR. Can you suggest recommended location in JCR to save such data. I belive /etc/<anynodename> should be okay but in case you see any challenges please suggest.

 is there any way to ensure the optimization of retrieval of values from these nodes

Frankly speaking right now I have not seriously thought about indexing the node but I think we are allowed to delete those nodes after we have successfully submitted them to the SAP webservice for reporting. In that case, I think we will not need to worry too much about optimization. But just to give it a serious thought, in case we need it we can have record date property saved for each data and index that property. Is it feasible ?

And your recommendation for saving the node by naming it as date so as to decrease the child nodes size is great. Will definitely go with that.

Thanks,

Samir

Avatar

Community Advisor

Hi Samir,

If you want really to store the data under jcr and delete once sent to SAP team then you can use the strategy that is used by Adobe for storing user 

As we know users under the /home/users will store under folder which starts with alphabetical order and same storing strategy you can use create folder structure based on some parameter and each folder can store 1000 nodes under it.

I think this may help you :)

Thanks 

Mani Kumar K

Avatar

Employee Advisor

Hi Samir,

don't store the data inside the JCR, but rather directly into the queue.

Jörg

Avatar

Level 4

Manikumar wrote...

Hi Samir,

If you want really to store the data under jcr and delete once sent to SAP team then you can use the strategy that is used by Adobe for storing user 

As we know users under the /home/users will store under folder which starts with alphabetical order and same storing strategy you can use create folder structure based on some parameter and each folder can store 1000 nodes under it.

I think this may help you :)

Thanks 

Mani Kumar K

 

Thanks,Yes Mani. Thats what I have been doing for saving products data under /etc/. breaking the products character by character and creating nodes till 5th character thereby reducing the number of immediate child nodes. 

Avatar

Level 4

Jörg Hoh wrote...

Hi Samir,

don't store the data inside the JCR, but rather directly into the queue.

Jörg

 

hi Jorg,

Thanks for replying. Actually storing into JCR, I will only do if I am unable to implement the RabbitMq. Otherwise I will definitely try to implement Rabbitmq queuing. Do you have any document or reference for queuing.

Thanks,

Samir