fetching/crawling data from AEM | Community
Skip to main content
Level 3
June 14, 2022
Solved

fetching/crawling data from AEM

  • June 14, 2022
  • 3 replies
  • 2882 views

Hi team, 

we have a requirement where we need to fetch/crawl entire data from AEM and ingest into our project.

what should be the best approach for this and also can you share some useful links/videos

 

Thank you,

Sriram

 

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by sunil_kumar_

Hi @sriram_1 , As you did not mention, what are you trying to achieve using this data. If you are trying to achieve search, Probabilty you should go for third party search like solar search. But for sake of answer. 
There are two ways you can do it. 

1. Iterate pages/assets/users and prepared result. 

2. Use query either query builder to SQL2. Write service to execute query in service code and get result. 


I am sharing example of both.  But first you have to get Resource resolver as below. 

https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-to-initialize-resourceresolver-reference-variable/m-p/455538#M131047

1. Giving some example code for iteration. Make sure your Resource Resolver object has proper permission to access required data/content/pages/users
Getting pages

 Page page = resourceResolver.adaptTo(PageManager.class).getPage("/content");
            Iterator<Page> childPages = page.listChildren(null,true);
            while (childPages.hasNext()) {
                Page childPage = childPages.next();
           }

Getting User and Groups. Printing in logs. You can use as per your need.

            ResourceResolver resourceResolver = ResolverUtil.newResolver(resourceResolverFactory);
            Session session = resourceResolver.adaptTo(Session.class);
            UserManager userManager = ((JackrabbitSession) session).getUserManager();
            Iterator<Authorizable> userIterator = userManager.findAuthorizables("jcr:primaryType", "rep:User");
            LOG.info("\n ----------GETTING USERS-------------");
            while (userIterator.hasNext()) {
                Authorizable user = userIterator.next();
                    LOG.info("\n User : {}", user.getPath());
            }
            Iterator<Authorizable> systemUserIterator = userManager.findAuthorizables("jcr:primaryType", "rep:SystemUser");
            LOG.info("\n ----------GETTING System USERS-------------");
            while (systemUserIterator.hasNext()) {
                Authorizable serviceUser = systemUserIterator.next();
                LOG.info("\n Service User : {}", serviceUser.getPath());
            }

            Iterator<Authorizable> groupIterator = userManager.findAuthorizables("jcr:primaryType", "rep:Group");
            LOG.info("\n ----------GETTING Groups-------------");
            while (groupIterator.hasNext()) {
                Authorizable group = groupIterator.next();
                LOG.info("\n Group : {}", group.getPath());
            }

2. Sharing some sample queries and code implementations. 
Query Builder query to get page and assets. I am sharing simplest one. create as per your need. 

/* ---To get Assets----*/
path=/content/dam
type=dam:Asset
p.limit=-1

/* ---To get Pages----*/
/* ---Adjust type as per your content----*/
path=/content
type=cq:PageContent
p.limit=-1

How to implement in backend 

@Reference
QueryBuilder queryBuilder;
       
 Map<String,String> queryMap=new HashMap<>();
   queryMap.put("path","/content/dam/we-retail");
   queryMap.put("type","dam:Asset");
   queryMap.put("p.limit",Long.toString(-1));
   final Session session = resourceResolver.adaptTo(Session.class);
   Query query = queryBuilder.createQuery(PredicateGroup.create(queryMap), session);
    SearchResult result = query.getResult();
    int perPageResults = result.getHits().size();
    long totalResults = result.getTotalMatches();
     List<Hit> hits =result.getHits();
        for(Hit hit: hits){
            Asset asset=hit.getResource().adaptTo(Asset.class);
           LOG.info("\n Page {} ",asset.getPath());
        }

In Case you use SQL 2

            String searchPath="/content/we-retail";
            String sql2Query = "SELECT * FROM [cq:PageContent] AS node WHERE ISDESCENDANTNODE ("+searchPath+") ORDER BY node.[jcr:title]";
            ResourceResolver resourceResolver = ResolverUtil.newResolver(resourceResolverFactory);
            final Session session = resourceResolver.adaptTo(Session.class);
            final javax.jcr.query.Query query = session.getWorkspace().getQueryManager().createQuery(sql2Query,javax.jcr.query.Query.JCR_SQL2);
            final QueryResult result = query.execute();
            NodeIterator pages=result.getNodes();
            JSONArray resultArray=new JSONArray();
            while(pages.hasNext()){
                Node page=pages.nextNode();
            }

These are just sample codes. Get Resource Resolver with proper permissions.

 

3 replies

SantoshSai
Community Advisor
Community Advisor
June 14, 2022

Hi @sriram_1 ,

Would be really appreciated if you elaborate what kind of data and for what purpose to understand more.

Regards,

Santosh 

Santosh Sai
sriram_1Author
Level 3
June 14, 2022

Hi @santoshsai 

 

kind of data: users, groups, sites, assets, forms, screens....

 

ingest data from AEM into our project for search optimization

 

Thanks

Sriram

SantoshSai
Community Advisor
Community Advisor
June 14, 2022

@sriram_1 

Usually in AEM we don't share data related to users, groups, etc

However If you wish to expose those data there will be HTTP API you can refer.

In terms of search optimization here are few links from Adobe as well as few from third party

To ensure the crawlers are crawling our website, we need to have sitemap.xml and a robots.txt which redirects the crawler to corresponding sitemap.xml Please refer Robot.txt

Sitemap Generator: https://adobe-consulting-services.github.io/acs-aem-commons/features/sitemap/index.html

Santosh Sai
Adobe Employee
June 14, 2022

Which search engine are you trying to use ?

sunil_kumar_
sunil_kumar_Accepted solution
Level 5
June 15, 2022

Hi @sriram_1 , As you did not mention, what are you trying to achieve using this data. If you are trying to achieve search, Probabilty you should go for third party search like solar search. But for sake of answer. 
There are two ways you can do it. 

1. Iterate pages/assets/users and prepared result. 

2. Use query either query builder to SQL2. Write service to execute query in service code and get result. 


I am sharing example of both.  But first you have to get Resource resolver as below. 

https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-to-initialize-resourceresolver-reference-variable/m-p/455538#M131047

1. Giving some example code for iteration. Make sure your Resource Resolver object has proper permission to access required data/content/pages/users
Getting pages

 Page page = resourceResolver.adaptTo(PageManager.class).getPage("/content");
            Iterator<Page> childPages = page.listChildren(null,true);
            while (childPages.hasNext()) {
                Page childPage = childPages.next();
           }

Getting User and Groups. Printing in logs. You can use as per your need.

            ResourceResolver resourceResolver = ResolverUtil.newResolver(resourceResolverFactory);
            Session session = resourceResolver.adaptTo(Session.class);
            UserManager userManager = ((JackrabbitSession) session).getUserManager();
            Iterator<Authorizable> userIterator = userManager.findAuthorizables("jcr:primaryType", "rep:User");
            LOG.info("\n ----------GETTING USERS-------------");
            while (userIterator.hasNext()) {
                Authorizable user = userIterator.next();
                    LOG.info("\n User : {}", user.getPath());
            }
            Iterator<Authorizable> systemUserIterator = userManager.findAuthorizables("jcr:primaryType", "rep:SystemUser");
            LOG.info("\n ----------GETTING System USERS-------------");
            while (systemUserIterator.hasNext()) {
                Authorizable serviceUser = systemUserIterator.next();
                LOG.info("\n Service User : {}", serviceUser.getPath());
            }

            Iterator<Authorizable> groupIterator = userManager.findAuthorizables("jcr:primaryType", "rep:Group");
            LOG.info("\n ----------GETTING Groups-------------");
            while (groupIterator.hasNext()) {
                Authorizable group = groupIterator.next();
                LOG.info("\n Group : {}", group.getPath());
            }

2. Sharing some sample queries and code implementations. 
Query Builder query to get page and assets. I am sharing simplest one. create as per your need. 

/* ---To get Assets----*/
path=/content/dam
type=dam:Asset
p.limit=-1

/* ---To get Pages----*/
/* ---Adjust type as per your content----*/
path=/content
type=cq:PageContent
p.limit=-1

How to implement in backend 

@Reference
QueryBuilder queryBuilder;
       
 Map<String,String> queryMap=new HashMap<>();
   queryMap.put("path","/content/dam/we-retail");
   queryMap.put("type","dam:Asset");
   queryMap.put("p.limit",Long.toString(-1));
   final Session session = resourceResolver.adaptTo(Session.class);
   Query query = queryBuilder.createQuery(PredicateGroup.create(queryMap), session);
    SearchResult result = query.getResult();
    int perPageResults = result.getHits().size();
    long totalResults = result.getTotalMatches();
     List<Hit> hits =result.getHits();
        for(Hit hit: hits){
            Asset asset=hit.getResource().adaptTo(Asset.class);
           LOG.info("\n Page {} ",asset.getPath());
        }

In Case you use SQL 2

            String searchPath="/content/we-retail";
            String sql2Query = "SELECT * FROM [cq:PageContent] AS node WHERE ISDESCENDANTNODE ("+searchPath+") ORDER BY node.[jcr:title]";
            ResourceResolver resourceResolver = ResolverUtil.newResolver(resourceResolverFactory);
            final Session session = resourceResolver.adaptTo(Session.class);
            final javax.jcr.query.Query query = session.getWorkspace().getQueryManager().createQuery(sql2Query,javax.jcr.query.Query.JCR_SQL2);
            final QueryResult result = query.execute();
            NodeIterator pages=result.getNodes();
            JSONArray resultArray=new JSONArray();
            while(pages.hasNext()){
                Node page=pages.nextNode();
            }

These are just sample codes. Get Resource Resolver with proper permissions.

 

SantoshSai
Community Advisor
Community Advisor
June 15, 2022

Looking into this I got one question @sunil_kumar_ - I believe, original request was about fetch/crawl AEM data, considering AEM best practices I don't think so we expose such data as you mentioned above to any search engine. Neither understood about concept of performing costly queries just to crawl. What I understood crawling - search engine optimization. correct me if I'm wrong.

Santosh Sai
sunil_kumar_
Level 5
June 15, 2022

@santoshsai If you read, I mention, I am adding this code sample for sake of answer. To answer about exposing data. It's all depends on user's use case. If client need, we have to do. May be client trying to get for some internal portal. we don't know what is exact use case.  In one of the replay, user mention, he is looking for users, groups, site, assets etc.

In this question user mention entire data. So for sake providing all information I added this code.  I am trying to add information as much as possible. Now let user decide what he need. This might help others as well. 
We are here to help as much as possible. When requirements are not clear, which are not in most of the cases, We should try help them with maximum information.