Expand my Community achievements bar.

July 31st AEM Gems Webinar: Elevate your AEM development to master the integration of private GitHub repositories within AEM Cloud Manager.
SOLVED

Best ways to query large AEM repository with query builder

Avatar

Level 2

Hello Everyone,

 

I'm looking for the best possible options to query AEM repository with query builder Query. I have a requirement to get all the assets from AEM DAM in JSON format. I have a big repository with Large assets of 400k in my AEM DAM instance.

 

AEM has a configurable limit on the number of nodes that can be visited in a single query run. By default, it is set to 100,000 nodes. I changed in System Console > OSGi Configuration > Apache Jackrabbit Query Engine Settings Service. The parameter name is queryLimitReads to 200k. I used the below query to fetch all assets under my site but the query is unable to run in AEM it throws error: "The query read or traversed more than 200000 nodes. To avoid affecting other tasks, processing was stopped."

 

p.hits=selective

p.limit=1000

p.offest=0

p.guessTotal=true

type=dam:Asset

path=/content/dam/mysite/

p.properties=jcr: content/cq:name  

orderby=@jcr:path

 

Any help or suggestion on resolving the error would be really helpful.

Any alternative options if you have please let me know.

 

 

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

YES, it sounds to me that you can actually avoid the query if you really need all the assets' info. So, it is better to traverse the nodes for that, you could use one of the techniques I suggested. See below the best practice suggested in such scenarios

 

Esteban666_0-1686059055654.png

 

 

https://experienceleague.adobe.com/docs/experience-manager-65/deploying/practices/best-practices-for...



Esteban Bustamante

View solution in original post

11 Replies

Avatar

Community Advisor

Always avoid as much as you can query builder, if you need all the assets listed you could safely traverse the repository instead.

Do it in small chunks and using multithreading, if you can deliver small pieces the better otherwise you could assemble small output pieces in a separate operation.

 

For the multithreading, you could use Bulk Workflow Manager or Managed Controlled Process



Esteban Bustamante

Avatar

Level 2

Thank you @EstebanBustamante

Do you mean, instead of querying all the assets we need to implement workflow or MCP process to crawl the repository to fetch all the assets?

Avatar

Correct answer by
Community Advisor

YES, it sounds to me that you can actually avoid the query if you really need all the assets' info. So, it is better to traverse the nodes for that, you could use one of the techniques I suggested. See below the best practice suggested in such scenarios

 

Esteban666_0-1686059055654.png

 

 

https://experienceleague.adobe.com/docs/experience-manager-65/deploying/practices/best-practices-for...



Esteban Bustamante

Avatar

Level 2

@Thanks @EstebanBustamante  Yes, MCP is really cool and works the way we need. I did implement this however it is taking a long time to export results into a spreadsheet. Your messages help me to troubleshoot the issue

Avatar

Employee Advisor

Hi,

 

When you encounter the error message "The query read or traversed more than 200000 nodes. To avoid affecting other tasks, processing was stopped," it indicates that the configured query limit of 200,000 nodes has been exceeded. This limit is in place to prevent excessive resource consumption and potential performance issues.

 

Your use case requires fetching a large number of assets and you have assessed the impact on system resources, you can try increasing the query limit further. However, keep in mind that this can potentially impact system performance, so it's important to test and monitor the system behavior after making such changes.

Avatar

Community Advisor

I think above issue occurs when the query is not using any indexing and it’s a traversal query. I believe jcr:content/cq:name is not a property in damAssetLucene indexing(available OOTB). If you have to specifically asked to add cq:name in query the you need to add indexes for that property in oak indexes. Once created the indexing you need to run the indexing so that it completes it then you should be able to get the query result.

you can also try with query that uses indexes. Aem provides you the console to check how optimize your query is. You can check there if it’s using any indexes or not

Avatar

Community Advisor

Hello @ArunaS 

 

There are multiple areas where an improvement can be made:

  1. Indexes: Assure that "jcr: content/cq:name" is indexed
  2. Consider removing orderby, if you don't need it.
  3. Use tree traversal to generate results. Since, you not filtering results, a tree traversal might be more apt. 
    • Then you can avoid touching other system params and affecting system performance in long run
  4. The Json will be huge. Once you are able to compile the data, you will hit the block on the size of json returned. Consider breaking Json into multiple chunks. Then you would also be able to break query also in chunks
  5. Caching: Consider caching the json, if this query is generated frequently.

Aanchal Sikka

Avatar

Level 2

Thanks, @Jörg_Hoh for the response.

It really helps me to export data into CSV. I'm looking to automate this process instead of manual clicks.

Avatar

Employee Advisor

If you follow https://experienceleague.adobe.com/docs/experience-manager-cloud-service/content/assets/admin/metada... you should be able to capture the HTTP-Request, analyze its parameters and then call it from externally.

Or do you need the Java API to access this feature?