Issue with oak index selection and ordered properties
I will break this question down into two parts because I think they are probably related.
The first relates to what seems like improper index selection, the second pertains to optimal configuration of indexes.
In AEM most of the queries we do are based on @sling:resourceType, typically with some second property restriction. In one such case we make queries for a particular resource type as well as restrictions on a property called @date. The @date property is of type Date. The queries include some restriction on date range as well as sorting the result by the @date property.
To improve performance a Lucene index has been created under our content tree; the index includes the property @date with the “ordered” flag. This index yielded significant (huge!) performance gains for queries relating to the @date property.
However, we have noticed that in some cases the custom Lucene index is not the index that is selected even when it would much more efficient and yield results in a fraction of time. Instead, the @sling:resourceType index (default out of the box) is used. The problem with using this index is that since the query includes sorting the results must be sorted and then filtered before returning the final results; this is a costly operation. Had the custom Lucene index been used (which includes the queries sort property as an ordered index property), the performance would have been much greater.
To confirm this the “costPerEntry” property on the custom Lucene index can be manipulated in order to force the engine to select the custom index; as expected the query executes in a fraction of the time. However, that is not a real solution because it forces the cost to be so low that the index is used even when it should not be.
This behaviour is not surprising when looking at the oak documentation, which states:
- Each query index is expected to estimate the worst-case cost to query with the given filter. The returned value is between 1 (very fast; lookup of a unique node) and the estimated number of entries to traverse, if the cursor would be fully read, and if there could in theory be one network round-trip or disk read operation per node (this method may return a lower number if the data is known to be fully in memory).
- The returned value is supposed to be an estimate and doesn’t have to be very accurate. Please note this method is called on each index whenever a query is run, so the method should be reasonably fast (not read any data itself, or at least not read too much data).
So this shows that the index selection is based on the cost of having to traverse the elements (worst case).
In our situation the @date property is found on nodes with many different @resourceType values. So for a @resourceType of “A” there would be fewer elements in the resourceType index with the value “A” than there would be elements in the @date index.
What the cost calculations do not seem to account for is the cost of sorting the results. If there are two competing indexes, lets call them X and Y; if X is a smaller index, but Y is ordered, and the query includes a "sort by" of a property located (and ordered) in Y; then unless X is drastically smaller than Y, Y would be the best index because X would need sorting (which is a cost not accounted for in the index cost calculations).
I pose the question of how to rectify this situation such that the best index is used; perhaps i have missed something or have failed to understand something; it seems to me that the cost calculation should and could make an estimate of the cost of sorting the results if the query requires it?
The second part of this question is if Lucene indexes should be broken down into individual indexes or aggregated together into single indexes. The documentation design considerations sate:
- ….
- Aim for a precise index configuration which indexes just the right amount of content based on your query requirement. An index which is precise would be smaller and would perform better.
- ….
- its recommended to use single index if all the properties being indexed are related. This would enable Lucene index to evaluate as much property restriction as possible natively (which is faster) and also save on storage cost incurred in storing the node path.
- ….
- Do not have overlapping index definition indexing same nodetype but having different includedPaths andexcludedPaths
For an index, such as discussed above with the ordered @date property, should @sling:resourceType also be included in that index? There exists out of the box a root index for @sling:resourceType, so is there a draw-back to including it again in a non-root index?
To extend on that concept, if we had an index for an unrelated query which has restrictions on both @sling:resourceType and @some-other-property, should a new non-root custom index be created, to support this query, which also includes @sling:resourceType.
The question is really the choice between these setups
option 1: avoid overlapping
/oak:index
/resourceType
->@sling:resourceType
/content
/custom1
->@date (ordered)
/custom2
->@some-other-property
option 2: include all the relevant properties that each type of query would need, two unrelated queries = two unrelated indexes.
/oak:index
/resourceType
->@sling:resourceType
/content
/custom1
->@date (ordered)
->@sling:resourceType
/custom2
->@some-other-property
->@sling:resourceType
option 3: include all relevant properties each type of query would need, one index: this makes the index larger than it needs to be since @date and @some-other-property are never used together in any query, but reduces multiple indexes having the @sling:resourceType property.
/oak:index
/resourceType
->@sling:resourceType
/content
/custom1
->@date (ordered)->@some-other-property
->@sling:resourceType