excluding properties from fulltext lucene index searches not working for deep nodes

cqsapientu69896

11-10-2020

I want to exclude certain properties from AEM full text search so that there are no matching results when page author's name is searched

 

for example if I search for Amit - it is providing few pages as results as there are few pages authored by Amit. I don't want these search results

 

I am using the default OOTB cqPageLucene index

 

I already checked the document  - https://jackrabbit.apache.org/oak/docs/query/lucene.html where it is mentioned that to

 

exclude a property we can add index(boolean) false on the property

 

I have added index(boolean) false on

 

jcr:content/cq:lastRolledoutBy 

jcr:content/cq:lastModifiedBy

jcr:content/cq:lastReplicatedBy

 

but the issue is that even below jcr:content node - there are various nodes in a page - a responsive grid inside another responsive grid and when an author drops a component that component always has jcr:createdBy ; jcr:lastModifiedBy properties which have the content author's id/name

 

So I am planning to use 

isRegexp 

as mentioned in the doc to write a regex and then set index(boolean) false

 

Has anyone else faced the same issue and can help in excluding these jcr:createdby and jcr:lastModifiedBy properties in deep nodes ? Am I going in the right direction using isRegexp ?

 

If yes  - what can be the right regex to exclude these properties from certain (n) level of nodes?

 

I read this https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/indexing-data-indexing-rul... but there is no solution for excluding properties in nested nodes

 

Any oak experts; lucene indexing gurus can help me on this?

 

thanks in advance

fulltext search lucene oak oak:index query string

Accepted Solutions (1)

Accepted Solutions (1)

Vijayalakshmi_S

MVP

12-10-2020

Hi @cqsapientu69896,

If a property is not to be part of full text search set the property -

  • nodeScopeIndex -> false(nodeScopeIndex set to true is responsible for a property being part of full text search)
  • If we set "index -> false", then the respective property will not be part of property constraint query result either/not indexed at all.
  • Also, property named analyzed -> false if the same is not to be part of contains query.

For restricting the property names using regex, 

  • isRegexp can be used to define the property name. Level of nodes up to which it is indexed is defined in "aggregates" node of cqPageLucene -> /oak:index/cqPageLucene/aggregates/cq:PageContent
  • Each include definition defines the level with respect to cq:Page node
  • Considering this and your content hierarchy, you can make use of isRegexp to frame the property name.

Note : The concern that you mentioned on "property name can be at any depth under cq:Page" can be controlled or handled using aggregates and property definition together. (In other words, depth of nodes to be indexed under cq:Page is defined with help of aggregates node. Even if it does at say 10th level, respective node might not have been indexed at first place unless we define them explicitly in include rule)

Example :

isRegex -> true
name -> jcr:content/*/*/*/.* (all properties of spacer node - jcr:content/root/responsivegrid/spacer)
or
name -> jcr:content/*/*/*/jcr:lastModifiedBy
analyzed -> false
nodeScopeIndex -> false

 

Answers (3)

Answers (3)

cqsapientu69896

14-10-2020

Thanks @Vijayalakshmi_S 

 

If you check my comment on the same day I asked this question; I had already tried 

 

excludeFromAggregation(bolean) false with complete property path and mentioned that it works

 

and also mentioned that it is not feasible and practical to add it for thousands of nodes as the node structure and hence the property path can be anything as per the content has been authored.

Vijayalakshmi_S

MVP

13-10-2020

Hi @cqsapientu69896,

Your understanding is correct. I missed about the open issue 

If your requirement is critical and need to be addressed by any means, consider the below.(approach not involving isRegexp, providing complete property path)

  • Per the aggregates defined as part of OOTB index - cqPageLucene (AEM 6.5.0), it indexes nodes 5 level below cq:Page node(Ex: jcr:content/root/responsivegrid/content_fragment/sample)
  • Properties that are part of testsample node in this path - jcr:content/root/responsivegrid/content_fragment/sample/testsample will not be indexed.
  • If the component nodes to be restricted is within this hierarchy and number of property paths(considering project content path patterns) to restrict is minimal, consider creating each of it providing full property path like jcr:content/root/responsivegrid/content_fragment/jcr:lastModifiedBy 

Also, as part of my trial in my local, could see that nodeScopeIndex/analyzed -> false is not restricting at times. You can try and if it is the same for you, use property named "excludeFromAggregation" -> true [Boolean] on the property instead.

Conclusion:

Providing full property path (without regex which works only for property names not for intermediate nodes) + "excludeFromAggregation" should work. Please try and update this thread.

cqsapientu69896

12-10-2020

Thanks @Vijayalakshmi_S for the descriptive answer

 

however it is not correct - I also added a comment to my question yesterday which mentioned that isRegexp does not support child nodes -

 

as it is also mentioned in the document  - https://jackrabbit.apache.org/oak/docs/query/lucene.html 

 

Note that the regular expression doesn’t match intermediate nodes, so, jcr:content/.*/.* would not index all properties for all children of jcr:content. OAK-5187 is an open improvement to track supporting arbitrary intermediate child nodes.

 

I tried adding a node with 

isRegexp  true;

analyzed false;

nodeScopeIndex false

and name as jcr:content/*/*/*/jcr*

 

and it still returned the result with author name (it is not excluding the property)

 

it is the same with these regex

 

jcr:content/*/*/*/jcr.*

 

jcr:content/*/*/*/jcr:lastModifiedBy

 

So it is not working  - and the reason for this is https://issues.apache.org/jira/browse/OAK-5187 

 

can you please let me know if my understanding is correct ? cc @kautuk_sahni