Adobe Experience Manager Sites & More

cqsapientu69896 · 10/11/20

I want to exclude certain properties from AEM full text search so that there are no matching results when page author's name is searched

for example if I search for Amit - it is providing few pages as results as there are few pages authored by Amit. I don't want these search results

I am using the default OOTB cqPageLucene index

I already checked the document - https://jackrabbit.apache.org/oak/docs/query/lucene.html where it is mentioned that to

exclude a property we can add index(boolean) false on the property

I have added index(boolean) false on

jcr:content/cq:lastRolledoutBy

jcr:content/cq:lastModifiedBy

jcr:content/cq:lastReplicatedBy

but the issue is that even below jcr:content node - there are various nodes in a page - a responsive grid inside another responsive grid and when an author drops a component that component always has jcr:createdBy ; jcr:lastModifiedBy properties which have the content author's id/name

So I am planning to use

isRegexp

as mentioned in the doc to write a regex and then set index(boolean) false

Has anyone else faced the same issue and can help in excluding these jcr:createdby and jcr:lastModifiedBy properties in deep nodes ? Am I going in the right direction using isRegexp ?

If yes - what can be the right regex to exclude these properties from certain (n) level of nodes?

I read this https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/indexing-data-indexing-rul... but there is no solution for excluding properties in nested nodes

Any oak experts; lucene indexing gurus can help me on this?

thanks in advance

Vijayalakshmi_S · 10/12/20

Hi @cqsapientu69896,

If a property is not to be part of full text search set the property -

nodeScopeIndex -> false(nodeScopeIndex set to true is responsible for a property being part of full text search)
If we set "index -> false", then the respective property will not be part of property constraint query result either/not indexed at all.
Also, property named analyzed -> false if the same is not to be part of contains query.

For restricting the property names using regex,

isRegexp can be used to define the property name. Level of nodes up to which it is indexed is defined in "aggregates" node of cqPageLucene -> /oak:index/cqPageLucene/aggregates/cq:PageContent
Each include definition defines the level with respect to cq:Page node
Considering this and your content hierarchy, you can make use of isRegexp to frame the property name.

Note : The concern that you mentioned on "property name can be at any depth under cq:Page" can be controlled or handled using aggregates and property definition together. (In other words, depth of nodes to be indexed under cq:Page is defined with help of aggregates node. Even if it does at say 10th level, respective node might not have been indexed at first place unless we define them explicitly in include rule)

Example :

isRegex -> true
name -> jcr:content/*/*/*/.* (all properties of spacer node - jcr:content/root/responsivegrid/spacer)
or
name -> jcr:content/*/*/*/jcr:lastModifiedBy
analyzed -> false
nodeScopeIndex -> false

View solution in original post

cqsapientu69896 · 10/11/20

Update : I was able to use excludeFromAggregation(bolean) false to exclude the nested properties from full text indexing; but the issue here is that one has to provide complete property name - for example jcr:content/root/responsivegrid/spacer/jcr:lastModifiedBy to exclude it - this is an example to exclude a property jcr:lastModifiedBy in a component named spacer just below the root responsivegrid. To add this for all components is not practical and feasible; as the content structure can be anything one responsivegrid inside another and hundreds of components and so on. I also read that there is an open oak issue https://issues.apache.org/jira/browse/OAK-5187 for allowing nested child to be indexed using isRegexp property. So for this reason it is not possible right now to exclude deep nested properties from getting indexed. Is there any other way of achieving this ? Will appreciate any help/guidance

Vijayalakshmi_S · 10/12/20

Hi @cqsapientu69896,

If a property is not to be part of full text search set the property -

nodeScopeIndex -> false(nodeScopeIndex set to true is responsible for a property being part of full text search)
If we set "index -> false", then the respective property will not be part of property constraint query result either/not indexed at all.
Also, property named analyzed -> false if the same is not to be part of contains query.

For restricting the property names using regex,

isRegexp can be used to define the property name. Level of nodes up to which it is indexed is defined in "aggregates" node of cqPageLucene -> /oak:index/cqPageLucene/aggregates/cq:PageContent
Each include definition defines the level with respect to cq:Page node
Considering this and your content hierarchy, you can make use of isRegexp to frame the property name.

Note : The concern that you mentioned on "property name can be at any depth under cq:Page" can be controlled or handled using aggregates and property definition together. (In other words, depth of nodes to be indexed under cq:Page is defined with help of aggregates node. Even if it does at say 10th level, respective node might not have been indexed at first place unless we define them explicitly in include rule)

Example :

isRegex -> true
name -> jcr:content/*/*/*/.* (all properties of spacer node - jcr:content/root/responsivegrid/spacer)
or
name -> jcr:content/*/*/*/jcr:lastModifiedBy
analyzed -> false
nodeScopeIndex -> false

MikeGforces · 10/13/20

Hello, I've got the similar problem. I want to exclude from the fulltext index (cqPageLucene) all the "technical" properties. As a test I've added on my local instance such node under the:

+cqPageLucene

+ indexRules

+ cq:Page

+ properties

+ techProp

-name="myTechPropName"

-index="{Boolean}false"

-excludeFromAggregation="{Boolean}true"

after the reindex I still get search results based on value in this property, so the above index definition properties mentioned in the Lucene documentation are not working.

https://jackrabbit.apache.org/oak/docs/query/lucene.html#indexing-rules

Vijayalakshmi_S · 10/14/20

Hi @MikeGforces,

Can you share the below details to debug further

AEM version
Contents of cqPageLucene index(If possible and If you would have made changes to OOB index)
Full text query that you are using
Exact property location in content hierarchy

cqsapientu69896 · 10/12/20

Thanks @Vijayalakshmi_S for the descriptive answer

however it is not correct - I also added a comment to my question yesterday which mentioned that isRegexp does not support child nodes -

as it is also mentioned in the document - https://jackrabbit.apache.org/oak/docs/query/lucene.html

Note that the regular expression doesn’t match intermediate nodes, so, jcr:content/.*/.* would not index all properties for all children of jcr:content. OAK-5187 is an open improvement to track supporting arbitrary intermediate child nodes.

I tried adding a node with

isRegexp true;

analyzed false;

nodeScopeIndex false

and name as jcr:content/*/*/*/jcr*

and it still returned the result with author name (it is not excluding the property)

it is the same with these regex

jcr:content/*/*/*/jcr.*

jcr:content/*/*/*/jcr:lastModifiedBy

So it is not working - and the reason for this is https://issues.apache.org/jira/browse/OAK-5187

can you please let me know if my understanding is correct ? cc @kautuk_sahni

Vijayalakshmi_S · 10/13/20

Hi @cqsapientu69896,

Your understanding is correct. I missed about the open issue

If your requirement is critical and need to be addressed by any means, consider the below.(approach not involving isRegexp, providing complete property path)

Per the aggregates defined as part of OOTB index - cqPageLucene (AEM 6.5.0), it indexes nodes 5 level below cq:Page node(Ex: jcr:content/root/responsivegrid/content_fragment/sample)
Properties that are part of testsample node in this path - jcr:content/root/responsivegrid/content_fragment/sample/testsample will not be indexed.
If the component nodes to be restricted is within this hierarchy and number of property paths(considering project content path patterns) to restrict is minimal, consider creating each of it providing full property path like jcr:content/root/responsivegrid/content_fragment/jcr:lastModifiedBy

Also, as part of my trial in my local, could see that nodeScopeIndex/analyzed -> false is not restricting at times. You can try and if it is the same for you, use property named "excludeFromAggregation" -> true [Boolean] on the property instead.

Conclusion:

Providing full property path (without regex which works only for property names not for intermediate nodes) + "excludeFromAggregation" should work. Please try and update this thread.

MikeGforces · 10/13/20

I've just read that comment and I've got the same problem.

cqsapientu69896 · 10/14/20

Thanks @Vijayalakshmi_S

If you check my comment on the same day I asked this question; I had already tried

excludeFromAggregation(bolean) false with complete property path and mentioned that it works

and also mentioned that it is not feasible and practical to add it for thousands of nodes as the node structure and hence the property path can be anything as per the content has been authored.

Adobe Experience Manager Sites & More

excluding properties from fulltext lucene index searches not working for deep nodes

Learn

Documentation

Community

Support

Resources

Adobe account

Adobe