Expand my Community achievements bar.

Issue with oak index selection and ordered properties

Avatar

Level 2

I will break this question down into two parts because I think they are probably related.

The first relates to what seems like improper index selection, the second pertains to optimal configuration of indexes.

 

In AEM most of the queries we do are based on @sling:resourceType, typically with some second property restriction. In one such case we make queries for a particular resource type as well as restrictions on a property called @date. The @date property is of type Date. The queries include some restriction on date range as well as sorting the result by the @date property.

To improve performance a Lucene index has been created under our content tree; the index includes the property @date with the “ordered” flag. This index yielded significant (huge!) performance gains for queries relating to the @date property.

However, we have noticed that in some cases the custom Lucene index is not the index that is selected even when it would much more efficient and yield results in a fraction of time. Instead, the @sling:resourceType index (default out of the box) is used. The problem with using this index is that since the query includes sorting the results must be sorted and then filtered before returning the final results; this is a costly operation. Had the custom Lucene index been used (which includes the queries sort property as an ordered index property), the performance would have been much greater.

To confirm this the “costPerEntry” property on the custom Lucene index can be manipulated in order to force the engine to select the custom index; as expected the query executes in a fraction of the time. However, that is not a real solution because it forces the cost to be so low that the index is used even when it should not be.

This behaviour is not surprising when looking at the oak documentation, which states: 

  • Each query index is expected to estimate the worst-case cost to query with the given filter. The returned value is between 1 (very fast; lookup of a unique node) and the estimated number of entries to traverse, if the cursor would be fully read, and if there could in theory be one network round-trip or disk read operation per node (this method may return a lower number if the data is known to be fully in memory).
  • The returned value is supposed to be an estimate and doesn’t have to be very accurate. Please note this method is called on each index whenever a query is run, so the method should be reasonably fast (not read any data itself, or at least not read too much data).

So this shows that the index selection is based on the cost of having to traverse the elements (worst case).

In our situation the @date property is found on nodes with many different @resourceType values. So for a @resourceType of “A” there would be fewer elements in the resourceType index with the value “A” than there would be elements in the @date index.

What the cost calculations do not seem to account for is the cost of sorting the results. If there are two competing indexes, lets call them X and Y; if X is a smaller index, but Y is ordered, and the query includes a "sort by" of a property located (and ordered) in Y; then unless X is drastically smaller than Y, Y would be the best index because X would need sorting (which is a cost not accounted for in the index cost calculations).

I pose the question of how to rectify this situation such that the best index is used; perhaps i have missed something or have failed to understand something; it seems to me that the cost calculation should and could make an estimate of the cost of sorting the results if the query requires it?

 

The second part of this question is if Lucene indexes should be broken down into individual indexes or aggregated together into single indexes. The documentation design considerations sate:

  • ….
  • Aim for a precise index configuration which indexes just the right amount of content based on your query requirement. An index which is precise would be smaller and would perform better.
  • ….
  • its recommended to use single index if all the properties being indexed are related. This would enable Lucene index to evaluate as much property restriction as possible natively (which is faster) and also save on storage cost incurred in storing the node path.
  • ….
  • Do not have overlapping index definition indexing same nodetype but having different includedPaths andexcludedPaths

For an index, such as discussed above with the ordered @date property, should @sling:resourceType also be included in that index? There exists out of the box a root index for @sling:resourceType, so is there a draw-back to including it again in a non-root index?

To extend on that concept, if we had an index for an unrelated query which has restrictions on both @sling:resourceType and @some-other-property, should a new non-root custom index be created, to support this query, which also includes @sling:resourceType.

 

The question is really the choice between these setups

option 1: avoid overlapping

/oak:index
    /resourceType
        ->@sling:resourceType
/content

    /custom1
        ->@date (ordered)
    /custom2
        ->@some-other-property

 

option 2: include all the relevant properties that each type of query would need, two unrelated queries = two unrelated indexes.

/oak:index
    /resourceType
        ->@sling:resourceType
/content
    /custom1
        ->@date (ordered)
        ->@sling:resourceType
    /custom2
        ->@some-other-property
        ->@sling:resourceType

 

option 3: include all relevant properties each type of query would need, one index: this makes the index larger than it needs to be since @date and @some-other-property are never used together in any query, but reduces multiple indexes having the @sling:resourceType property.

/oak:index
    /resourceType
        ->@sling:resourceType
/content

    /custom1
        ->@date (ordered)->@some-other-property
        ->@sling:resourceType

3 Replies

Avatar

Level 10

Your description was good and at same time has question as well answer. Let me give a shot

Question1:-     Yes manupulate cost calculation Or use native query so that cost caluclation is skipped & you can specify directly which index to use.

Ouestion2:-     You specified all the option. What is your expectation here?

Avatar

Level 2

Question 1 -

The issue I face with manipulating the cost is that now the index (with its cost manipulated) is sometimes used when its the poorer choice; it ends up just reversing the problem: queries that were performing poorly because they weren't using this index now perform well, but other queries that were performing well, in some cases, now perform poorly because they ARE using this index where they weren't using it before.

Using native queries seems like it could be the best solution. Is there a way of making use of native queries with the "QueryBuilder"?

Question 2 -

The question is if one of those three options is superior or more "proper" than the others?

  • Is there a drawback to including sling:resourceType again in another index other than increasing the size of that index?
  • Will including sling:resourceType in the index improve performance because all of the properties used in the query are located in the same index, or will it not matter?
  • Will including all properties that might be queried by any number of un-related queries into one large index decrease performance, or should indexes be geared individually towards satisfying the needs of a small set of related queries; even if this means multiple indexes containing the same property (e.g. "sling:resourceType" would be included in most of our indexes)?

Avatar

Level 1

Very good question! I have the same problems and similar thoughts. Unfortunately I cannot give you a satisfying answer on this, but I will try to bring you to some ideas what you can try or in which direction to think in such cases.

Question 1

Sure manipulating the cost is an option, but how you said, but this forces other apps and more important AEM core services to use your index in some cases as well and this is not something that you want to change. So using native queries is possibly one of the few options you have. You can implement custom predicate evaluator in order to use them with the QueryBuilder. But using native queries is again not the best option one may choose (think of maintainability) Other option is to tune your index in a way that you include more precise the node type where the property index should match (e.g. instead nt:base: +myprop -> cq:PageContent: +myprop)

Question 2

Defining individual indexes keeps things clear and creates some kind of modularity in your projects which is great. Unfortunately this leads to issues, when you have to define for example the same type (e.g. nt:unstructured) for different individual indexes (e.g. different parts of your business logic) and different properties. The index chooser may pick (and probably in the most cases will) the wrong index where non (in the best case part of them) of your properties are indexed, just because for example the queryPaths, the property index types or the aggregation matches and this will be essentially slower than using the actual index. For me the index chooser algorithm is not so optimal. It's created to be fast, which is also very important as this should cost no time, but it doesn't consider other important things of your index definition. So in my opinion but it should analyze a little bit more than it actually does. On the one hand the documentation says "Do not have overlapping index definition indexing same nodetype...Having multiple definition for same type would cause ambiguity in index selection and may lead to unexpected results", on the other hand "Aim for a precise index configuration which indexes just the right amount of content based on your query requirement. An index which is precise would be smaller and would perform better." and "In fact its recommended to use single index if all the properties being indexed are related". I cannot imagine that this may happen in a real world scenario. So to answer to your question .. I would include sling:resourceType as well if the node type of other index is not defining it already (think about of "its recommended to use single index if all the properties being indexed are related. This would enable the Lucene index to evaluate as much property restriction as possible natively (which is faster)") and always check your related (also overlapping) queries using the analyzer.