Extend FUNNEL_COUNT to support multiple CORRELATE_BY columns#18760
Extend FUNNEL_COUNT to support multiple CORRELATE_BY columns#18760tarun11Mavani wants to merge 5 commits into
Conversation
Enable funnel analysis that tracks users through steps within a composite key (e.g., per user per device category) by accepting multiple columns in CORRELATE_BY(col1, col2, ...). The single-key path is preserved as a zero-overhead fast path with separate addSingleKey/addMultiKey abstract methods and dedicated aggregation loops, ensuring no regression for existing single-column queries. Multi-key composite ID mapping uses stride-based arithmetic when the product of dictionary sizes fits in int, with a HashMap fallback for large key spaces. Co-authored-by: Cursor <cursoragent@cursor.com>
Benchmark was used for local validation only; not needed in the PR. Co-authored-by: Cursor <cursoragent@cursor.com>
Performance Validation (JMH)Ran Single-key path — Before (baseline) vs After (this PR):
*theta_sketch and partitioned_sorted show large error bars indicating JVM warmup variance, not a real regression. Scores overlap within error margins. Multi-key path (new feature, this PR only):
Single-key path shows NO statistically significant regression. All deltas are within error margins. The bitmap/set/partitioned strategies (which dominate real workloads) are within ±2% of baseline — effectively identical. |
Keep the original `add(Dictionary, A, int, int)` abstract method unchanged. The new multi-key method is added as `addMultiKey(A, int, Dictionary[], int[])`. Co-authored-by: Cursor <cursoragent@cursor.com>
e1d2196 to
d6bb092
Compare
…egationResult double-count - Add DictIdsWrapperTest covering the HashMap fallback path (large-cardinality composite keys where product of dict sizes exceeds Integer.MAX_VALUE): path selection, sequential ID assignment, same-key idempotency, key-order sensitivity, and round-trip for 2- and 3-column keys. Also covers stride-path reverseCompositeId round-trip. Add isHashMapPath() predicate to DictIdsWrapper for test introspection (avoids widening _strides visibility). - Add SortedAggregationResultTest with multi-key extraction scenarios. - Fix SortedAggregationResult.extractResult(): clear _secondaryKeySteps after flushMultiKeyGroup() so a second call (defensive) returns zeros rather than double-counting the last open primary group.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18760 +/- ##
============================================
+ Coverage 64.67% 64.79% +0.12%
- Complexity 1309 1319 +10
============================================
Files 3381 3388 +7
Lines 209821 210439 +618
Branches 32805 32993 +188
============================================
+ Hits 135697 136353 +656
+ Misses 63230 63110 -120
- Partials 10894 10976 +82
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| int[] dictIds = new int[numKeys]; | ||
| while (iterator.hasNext()) { | ||
| wrapper.reverseCompositeId(iterator.next(), dictIds); | ||
| valueBitmap.add(DictIdsWrapper.toCompositeString(wrapper._dictionaries, dictIds).hashCode()); |
There was a problem hiding this comment.
DictIdsWrapper.toCompositeString(wrapper._dictionaries, dictIds).hashCode())
This now generates a theoretical hash collision so bitmap strategy for multikey correlation is not accurate anymore.
We will have to either call this out as limitation in docs or find an alternative.
There was a problem hiding this comment.
This is actually an existing limitation of the bitmap strategy, not something new from multi-key. The single-key path in convertToValueBitmap already uses .hashCode() for LONG, FLOAT, DOUBLE, and STRING types — only INT gets exact values stored directly. The multi-key path is consistent: toCompositeString itself is collision-free (length-prefix encoding is injective), but the .hashCode() mapping to 32-bit int has the same collision properties as single-key STRING at line 109.
I've updated the method Javadoc on convertCompositeToValueBitmap to call this out more explicitly, linking it to the existing single-key non-INT approximation.
|
|
||
| @Override | ||
| void addMultiKey(UpdateSketch[] stepsSketches, int step, Dictionary[] dictionaries, int[] correlationDictIds) { | ||
| stepsSketches[step].update(DictIdsWrapper.toCompositeString(dictionaries, correlationDictIds)); |
There was a problem hiding this comment.
I think there will be a lot of new string creation cost and subsequent GC pressure with toCompositeString for each row.
Similarly at other places if cardinality of distinct correlation multi-keys is high in a query.
There was a problem hiding this comment.
Fair point — toCompositeString does allocate a new StringBuilder + String per row. A couple of things to note though:
- Theta sketch's
update()only accepts primitives (int,long,double) orString/byte[]. Since a multi-key tuple has no single primitive representation, some form of serialization is unavoidable here. - The single-key STRING path (line 75) already allocates a string per row via
dictionary.getStringValue(), so the cost pattern is structurally similar — just slightly more overhead from the length-prefix encoding. - JMH baseline for multi-key theta_sketch is 287 ops/s, which we can measure future optimizations against.
One option would be switching to update(byte[]) with a reusable ByteBuffer to avoid the String intermediate, but wanted to keep it simple for the initial implementation. Do you have other optimization ideas in mind?
Add method-level doc on convertCompositeToValueBitmap linking the multi-key .hashCode() usage to the existing single-key non-INT approximation in convertToValueBitmap.
Summary
Extends
FUNNEL_COUNTto accept multiple columns inCORRELATE_BY(col1, col2, ...),enabling funnel analysis that tracks users through steps within a composite key
(e.g., per user per device category), not just a single dimension.
Design
Doc with example: https://docs.google.com/document/d/1gWQ7XBbJdQcUdZvBevFnGTVbCVJ3fN49biIsSOtRdhM/edit?tab=t.0
The single-key aggregation path is preserved as a zero-overhead fast path — structurally
identical to the original single-column implementation — so existing queries see no
regression. Multi-key support is added as a separate code path selected once per block.
AggregationStrategy: Split into two abstract methods (addSingleKey/addMultiKey)with separate aggregation loops for single-key and multi-key, eliminating per-row branching
on the dominant single-key path.
DictIdsWrapper: Added composite-key mapping for multi-column CORRELATE_BY. Usesstride-based arithmetic when the product of dictionary sizes fits in
int, falling backto a
HashMap<IntArrayList, Integer>for large key spaces. Also addstoCompositeStringfor length-prefix encoded composite string keys used during result extraction.
SortedAggregationResult: Updated to handle multi-key by tracking secondary keys viaa
HashMapwithin each primary-key group (data is sorted on the primary column only).BitmapAggregationStrategy,SortedAggregationStrategy,ThetaSketchAggregationStrategy: Implement bothaddSingleKeyandaddMultiKey.SetResultExtractionStrategy,BitmapResultExtractionStrategy: Updated toreverse-map composite IDs back to per-column dictionary values during result extraction.
FunnelCountSortedAggregationFunction: Propagates multi-dictionary context throughthe sorted aggregation result extraction pipeline.
Example Query
Test Plan