Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3049,17 +3049,42 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,

if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
final RowSchema schema = lop.getSchema();
final List<ColumnInfo> signature = lop.getSchema().getSignature();
final int numSelColumns = lop.getConf().getNumSelColumns();

// Split schemas using subList
RowSchema selectSchema = new RowSchema(new ArrayList<>(signature.subList(0, numSelColumns)));
RowSchema udtfSchema = new RowSchema(new ArrayList<>(signature.subList(numSelColumns, signature.size())));

// Filter expression maps to avoid cross-contamination in getColStatisticsFromExprMap
Map<String, ExprNodeDesc> selectExprMap = Maps.newHashMapWithExpectedSize(numSelColumns);
Map<String, ExprNodeDesc> udtfExprMap = Maps.newHashMapWithExpectedSize(signature.size() - numSelColumns);
for (int i = 0; i < signature.size(); i++) {
String name = signature.get(i).getInternalName();
ExprNodeDesc expr = columnExprMap.get(name);

if (expr == null) {
continue;
}

if (i < numSelColumns) {
selectExprMap.put(name, expr);
} else {
udtfExprMap.put(name, expr);
}
}

// Select branch stats
joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
final List<ColStatistics> selectColStats = StatsUtils
.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
.getColStatisticsFromExprMap(conf, selectStats, selectExprMap, selectSchema);
StatsUtils.scaleColStatistics(selectColStats, factor);
joinedStats.addToColumnStats(selectColStats);

// UDTF branch stats
joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
final List<ColStatistics> udtfColStats = StatsUtils
.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
.getColStatisticsFromExprMap(conf, udtfStats, udtfExprMap, udtfSchema);
joinedStats.addToColumnStats(udtfColStats);
}

Expand Down

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions ql/src/test/queries/clientpositive/lvj_stats_isolation.q
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
create table lvj_stats (id string, f1 string);

insert into lvj_stats values
('a','v1'), ('a','v2'), ('a','v3'),
('b','v4'), ('b','v5'), ('b','v6');

analyze table lvj_stats compute statistics;
analyze table lvj_stats compute statistics for columns;

-- Test that LV columns' stats no longer inflate SELECT columns' sizes
explain
select id, f1, count(*)
from (select id, f1 from lvj_stats group by id, f1) sub
lateral view posexplode(array(f1, f1)) t1 as pos1, val1
group by id, f1;

select id, f1, count(*)
from (select id, f1 from lvj_stats group by id, f1) sub
lateral view posexplode(array(f1, f1)) t1 as pos1, val1
group by id, f1;

-- Test that LV columns' stats no longer override NDV of a base column
alter table lvj_stats update statistics for column id set('numDVs'='0','numNulls'='0');

explain
select id, count(*)
from (select id, f1 from lvj_stats group by id, f1) sub
lateral view posexplode(array(f1, f1)) t1 as pos1, val1
group by id;

select id, count(*)
from (select id, f1 from lvj_stats group by id, f1) sub
lateral view posexplode(array(f1, f1)) t1 as pos1, val1
group by id;

Large diffs are not rendered by default.

360 changes: 360 additions & 0 deletions ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

Large diffs are not rendered by default.

22 changes: 11 additions & 11 deletions ql/src/test/results/clientpositive/llap/union26.q.out
Original file line number Diff line number Diff line change
Expand Up @@ -129,20 +129,20 @@ STAGE PLANS:
Select Operator
expressions: _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a typical example of LV column stats impacting the data size estimations of SELECT columns:

` Column Naming

Context Column Name Represents avgColLen
LVJ output schema _col0 SELECT's key 2.812
LVJ output schema _col1 SELECT's value 6.812
LVJ output schema _col8 UDTF's exploded element
UDTF internal stats _col0 array expression input 56.0

The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.


Processing Comparison

Step Original Code Proposed Fix
Expression Map Shared: {_col0, _col1, _col8} Split: SELECT {_col0, _col1}, UDTF {_col8}
Schema Full: [_col0, _col1, _col8] Split by numSelColumns
UDTF lookup for _col0 Looks up _col0 in udtfStats → finds array's _col0 (56.0) _col0 not in udtfExprMap → skipped
UDTF lookup for _col8 _col8 → Column[col], not found in udtfStats _col8 → Column[col], not found in udtfStats
Merge _col0 MAX(2.812, 56.0) = 56.0 No collision → 2.812

Final Column Statistics

Column Original Code Proposed Fix
_col0 avgColLen 56.0 ✗ 2.812 ✓
_col1 avgColLen 6.812 6.812
Per-row total 62.812 bytes 9.624 bytes

Data Size — LVJ Debug Output (500 rows)

Original Code Proposed Fix
Calculation 62.812 × 500 9.624 × 500
Total 31,406 bytes 4,812 bytes

Data Size — EXPLAIN Output (500 rows)

Column Original Code Proposed Fix
key avgColLen 140 ✗ 87 ✓
value avgColLen 91 91
Per-row total 231 bytes 178 bytes
Original Code Proposed Fix
Calculation 231 × 500 178 × 500
Total 115,500 bytes 89,000 bytes

Group By Operator
aggregations: count()
keys: _col0 (type: string), _col1 (type: string)
minReductionHashAggr: 0.4
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string)
null sort order: zz
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col2 (type: bigint)
Select Operator
expressions: array(1,2,3) (type: array<int>)
Expand All @@ -157,20 +157,20 @@ STAGE PLANS:
Select Operator
expressions: _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count()
keys: _col0 (type: string), _col1 (type: string)
minReductionHashAggr: 0.4
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string)
null sort order: zz
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col2 (type: bigint)
Execution mode: llap
LLAP IO: all inputs
Expand All @@ -191,13 +191,13 @@ STAGE PLANS:
minReductionHashAggr: 0.4
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string)
null sort order: zz
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col2 (type: bigint)
Reducer 4
Execution mode: vectorized, llap
Expand All @@ -207,14 +207,14 @@ STAGE PLANS:
keys: KEY._col0 (type: string), KEY._col1 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col2 (type: bigint), _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 645 Data size: 154155 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 645 Data size: 119970 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Expand Down
16 changes: 8 additions & 8 deletions ql/src/test/results/clientpositive/llap/union_lateralview.q.out
Original file line number Diff line number Diff line change
Expand Up @@ -90,13 +90,13 @@ STAGE PLANS:
Select Operator
expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col1 (type: string)
null sort order: z
sort order: +
Map-reduce partition columns: _col1 (type: string)
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: int), _col2 (type: string)
Select Operator
expressions: Const array<int> [1, 2, 3] (type: array<int>)
Expand All @@ -111,13 +111,13 @@ STAGE PLANS:
Select Operator
expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col1 (type: string)
null sort order: z
sort order: +
Map-reduce partition columns: _col1 (type: string)
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: int), _col2 (type: string)
Execution mode: llap
LLAP IO: all inputs
Expand All @@ -143,13 +143,13 @@ STAGE PLANS:
Select Operator
expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col1 (type: string)
null sort order: z
sort order: +
Map-reduce partition columns: _col1 (type: string)
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: int), _col2 (type: string)
Select Operator
expressions: Const array<int> [1, 2, 3] (type: array<int>)
Expand All @@ -164,13 +164,13 @@ STAGE PLANS:
Select Operator
expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col1 (type: string)
null sort order: z
sort order: +
Map-reduce partition columns: _col1 (type: string)
Statistics: Num rows: 1000 Data size: 231000 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1000 Data size: 178000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: int), _col2 (type: string)
Execution mode: llap
LLAP IO: all inputs
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5579,7 +5579,7 @@ STAGE PLANS:
className: VectorSelectOperator
native: true
projectedOutputColumnNums: [0, 3, 1, 2]
Statistics: Num rows: 26 Data size: 6890 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 26 Data size: 5798 Basic stats: COMPLETE Column stats: COMPLETE
PTF Operator
Function definitions:
Input definition
Expand Down Expand Up @@ -5613,21 +5613,21 @@ STAGE PLANS:
outputTypes: [bigint, string, string, int, int]
partitionExpressions: [col 0:string]
streamingColumns: []
Statistics: Num rows: 26 Data size: 6890 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 26 Data size: 5798 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col4 (type: int), _col2 (type: int), sum_window_0 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Select Vectorization:
className: VectorSelectOperator
native: true
projectedOutputColumnNums: [0, 3, 2, 1, 4]
Statistics: Num rows: 26 Data size: 7098 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 26 Data size: 6006 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
File Sink Vectorization:
className: VectorFileSinkOperator
native: false
Statistics: Num rows: 26 Data size: 7098 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 26 Data size: 6006 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Expand Down
8 changes: 4 additions & 4 deletions ql/src/test/results/clientpositive/nonmr_fetch.q.out
Original file line number Diff line number Diff line change
Expand Up @@ -1081,11 +1081,11 @@ STAGE PLANS:
Statistics: Num rows: 500 Data size: 1003500 Basic stats: COMPLETE Column stats: COMPLETE
Limit
Number of rows: 20
Statistics: Num rows: 20 Data size: 40080 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 20 Data size: 1740 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 20 Data size: 40080 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 20 Data size: 1740 Basic stats: COMPLETE Column stats: COMPLETE
ListSink
Select Operator
expressions: array(_col0,_col1) (type: array<string>)
Expand All @@ -1099,11 +1099,11 @@ STAGE PLANS:
Statistics: Num rows: 500 Data size: 1003500 Basic stats: COMPLETE Column stats: COMPLETE
Limit
Number of rows: 20
Statistics: Num rows: 20 Data size: 40080 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 20 Data size: 1740 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 20 Data size: 40080 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 20 Data size: 1740 Basic stats: COMPLETE Column stats: COMPLETE
ListSink

PREHOOK: query: select key,X from srcpart lateral view explode(array(key,value)) L as x where (ds='2008-04-08' AND hr='11') limit 20
Expand Down
Loading