Expand my Community achievements bar.

Guidelines for the Responsible Use of Generative AI in the Experience Cloud Community.
SOLVED

excluding properties from fulltext lucene index searches not working for deep nodes

Avatar

Level 4

I want to exclude certain properties from AEM full text search so that there are no matching results when page author's name is searched

 

for example if I search for Amit - it is providing few pages as results as there are few pages authored by Amit. I don't want these search results

 

I am using the default OOTB cqPageLucene index

 

I already checked the document  - https://jackrabbit.apache.org/oak/docs/query/lucene.html where it is mentioned that to

 

exclude a property we can add index(boolean) false on the property

 

I have added index(boolean) false on

 

jcr:content/cq:lastRolledoutBy 

jcr:content/cq:lastModifiedBy

jcr:content/cq:lastReplicatedBy

 

but the issue is that even below jcr:content node - there are various nodes in a page - a responsive grid inside another responsive grid and when an author drops a component that component always has jcr:createdBy ; jcr:lastModifiedBy properties which have the content author's id/name

 

So I am planning to use 

isRegexp 

as mentioned in the doc to write a regex and then set index(boolean) false

 

Has anyone else faced the same issue and can help in excluding these jcr:createdby and jcr:lastModifiedBy properties in deep nodes ? Am I going in the right direction using isRegexp ?

 

If yes  - what can be the right regex to exclude these properties from certain (n) level of nodes?

 

I read this https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/indexing-data-indexing-rul... but there is no solution for excluding properties in nested nodes

 

Any oak experts; lucene indexing gurus can help me on this?

 

thanks in advance

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

Hi @cqsapientu69896,

If a property is not to be part of full text search set the property -

  • nodeScopeIndex -> false(nodeScopeIndex set to true is responsible for a property being part of full text search)
  • If we set "index -> false", then the respective property will not be part of property constraint query result either/not indexed at all.
  • Also, property named analyzed -> false if the same is not to be part of contains query.

For restricting the property names using regex, 

  • isRegexp can be used to define the property name. Level of nodes up to which it is indexed is defined in "aggregates" node of cqPageLucene -> /oak:index/cqPageLucene/aggregates/cq:PageContent
  • Each include definition defines the level with respect to cq:Page node
  • Considering this and your content hierarchy, you can make use of isRegexp to frame the property name.

Note : The concern that you mentioned on "property name can be at any depth under cq:Page" can be controlled or handled using aggregates and property definition together. (In other words, depth of nodes to be indexed under cq:Page is defined with help of aggregates node. Even if it does at say 10th level, respective node might not have been indexed at first place unless we define them explicitly in include rule)

Example :

isRegex -> true
name -> jcr:content/*/*/*/.* (all properties of spacer node - jcr:content/root/responsivegrid/spacer)
or
name -> jcr:content/*/*/*/jcr:lastModifiedBy
analyzed -> false
nodeScopeIndex -> false

 

View solution in original post

8 Replies

Avatar

Level 4
Update : I was able to use excludeFromAggregation(bolean) false to exclude the nested properties from full text indexing; but the issue here is that one has to provide complete property name - for example jcr:content/root/responsivegrid/spacer/jcr:lastModifiedBy to exclude it - this is an example to exclude a property jcr:lastModifiedBy in a component named spacer just below the root responsivegrid. To add this for all components is not practical and feasible; as the content structure can be anything one responsivegrid inside another and hundreds of components and so on. I also read that there is an open oak issue https://issues.apache.org/jira/browse/OAK-5187 for allowing nested child to be indexed using isRegexp property. So for this reason it is not possible right now to exclude deep nested properties from getting indexed. Is there any other way of achieving this ? Will appreciate any help/guidance

Avatar

Correct answer by
Community Advisor

Hi @cqsapientu69896,

If a property is not to be part of full text search set the property -

  • nodeScopeIndex -> false(nodeScopeIndex set to true is responsible for a property being part of full text search)
  • If we set "index -> false", then the respective property will not be part of property constraint query result either/not indexed at all.
  • Also, property named analyzed -> false if the same is not to be part of contains query.

For restricting the property names using regex, 

  • isRegexp can be used to define the property name. Level of nodes up to which it is indexed is defined in "aggregates" node of cqPageLucene -> /oak:index/cqPageLucene/aggregates/cq:PageContent
  • Each include definition defines the level with respect to cq:Page node
  • Considering this and your content hierarchy, you can make use of isRegexp to frame the property name.

Note : The concern that you mentioned on "property name can be at any depth under cq:Page" can be controlled or handled using aggregates and property definition together. (In other words, depth of nodes to be indexed under cq:Page is defined with help of aggregates node. Even if it does at say 10th level, respective node might not have been indexed at first place unless we define them explicitly in include rule)

Example :

isRegex -> true
name -> jcr:content/*/*/*/.* (all properties of spacer node - jcr:content/root/responsivegrid/spacer)
or
name -> jcr:content/*/*/*/jcr:lastModifiedBy
analyzed -> false
nodeScopeIndex -> false

 

Avatar

Level 1

Hello, I've got the similar problem. I want to exclude from the fulltext index (cqPageLucene) all the "technical" properties. As a test I've added on my local instance such node under the:

+cqPageLucene

 + indexRules

  + cq:Page

   + properties

     + techProp

      -name="myTechPropName"

      -index="{Boolean}false"

      -excludeFromAggregation="{Boolean}true"

after the reindex I still get search results based on value in this property, so the above index definition properties mentioned in the Lucene documentation are not working.

https://jackrabbit.apache.org/oak/docs/query/lucene.html#indexing-rules

Avatar

Community Advisor

Hi @MikeGforces,

Can you share the below details to debug further

  • AEM version
  • Contents of cqPageLucene index(If possible and If you would have made changes to OOB index)
  • Full text query that you are using
  • Exact property location in content hierarchy

 

 

Avatar

Level 4

Thanks @Vijayalakshmi_S for the descriptive answer

 

however it is not correct - I also added a comment to my question yesterday which mentioned that isRegexp does not support child nodes -

 

as it is also mentioned in the document  - https://jackrabbit.apache.org/oak/docs/query/lucene.html 

 

Note that the regular expression doesn’t match intermediate nodes, so, jcr:content/.*/.* would not index all properties for all children of jcr:content. OAK-5187 is an open improvement to track supporting arbitrary intermediate child nodes.

 

I tried adding a node with 

isRegexp  true;

analyzed false;

nodeScopeIndex false

and name as jcr:content/*/*/*/jcr*

 

and it still returned the result with author name (it is not excluding the property)

 

it is the same with these regex

 

jcr:content/*/*/*/jcr.*

 

jcr:content/*/*/*/jcr:lastModifiedBy

 

So it is not working  - and the reason for this is https://issues.apache.org/jira/browse/OAK-5187 

 

can you please let me know if my understanding is correct ? cc @kautuk_sahni 

 

 

 

Avatar

Community Advisor

Hi @cqsapientu69896,

Your understanding is correct. I missed about the open issue 

If your requirement is critical and need to be addressed by any means, consider the below.(approach not involving isRegexp, providing complete property path)

  • Per the aggregates defined as part of OOTB index - cqPageLucene (AEM 6.5.0), it indexes nodes 5 level below cq:Page node(Ex: jcr:content/root/responsivegrid/content_fragment/sample)
  • Properties that are part of testsample node in this path - jcr:content/root/responsivegrid/content_fragment/sample/testsample will not be indexed.
  • If the component nodes to be restricted is within this hierarchy and number of property paths(considering project content path patterns) to restrict is minimal, consider creating each of it providing full property path like jcr:content/root/responsivegrid/content_fragment/jcr:lastModifiedBy 

Also, as part of my trial in my local, could see that nodeScopeIndex/analyzed -> false is not restricting at times. You can try and if it is the same for you, use property named "excludeFromAggregation" -> true [Boolean] on the property instead.

Conclusion:

Providing full property path (without regex which works only for property names not for intermediate nodes) + "excludeFromAggregation" should work. Please try and update this thread.

Avatar

Level 1
I've just read that comment and I've got the same problem.

Avatar

Level 4

Thanks @Vijayalakshmi_S 

 

If you check my comment on the same day I asked this question; I had already tried 

 

excludeFromAggregation(bolean) false with complete property path and mentioned that it works

 

and also mentioned that it is not feasible and practical to add it for thousands of nodes as the node structure and hence the property path can be anything as per the content has been authored.