线上一个查询简化如下:Selectdt,count(distinctc1),count(distinctcasewhenc20andc1=0thenc1end),count(distinctcasewhenc20andc10thenc1end)fromtwheredtbetwe
ABSTRACTSYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT(TOK_SELEXPR (TOK_TABLE_OR_COL dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count(TOK_TABLE_OR_COL c1))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1)))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE(TOK_TABLE_OR_COL dt) '20131108' '20131110')) (TOK_GROUPBY (TOK_TABLE_OR_COLdt))))
STAGEDEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGEPLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t
TableScan
alias: t
Filter Operator
predicate:
expr: dt BETWEEN '20131108'AND '20131110'
type: Boolean
//通过select operator做投影
Select Operator
expressions:
expr: dt
type: string
expr: c1
type: int
expr: c2
type: int
outputColumnNames: dt, c1, c2
//在MAP端进行简单的聚合,雷区1:假设有N个distinct,MAP处理数据有M条,,那么这部处理后的输出是N*M条数据,因为MAP会对dt,keys[i]做聚合操作,所以尽量在MAP端过滤尽可能多的数据
Group By Operator
aggregations:
expr: count(DISTINCTc1)
expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 = 0))) THEN (c1) END)
expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 > 0))) THEN (c1) END)
bucketGroup: false
keys:
expr: dt
type: string
expr: c1
type: int
expr: CASE WHEN (((c2> 0) and (c1 = 0))) THEN (c1) END
type: int
expr: CASE WHEN (((c2> 0) and (c1 > 0))) THEN (c1) END
type: int
mode: hash
outputColumnNames: _col0,_col1, _col2, _col3, _col4, _col5, _col6
//雷区2:在做Reduce Sink时是根据partition cplumns进行HASH的方式,那么对于按date分区的表来说一天的所有数据被放大N倍传输到Reducer进行运算,导致性能长尾或者OOME.
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: int
expr: _col2
type: int
expr: _col3
type: int
sort order: ++++
Map-reduce partitioncolumns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col4
type: bigint
expr: _col5
type: bigint
expr: _col6
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCTKEY._col1:0._col0)
expr: count(DISTINCTKEY._col1:1._col0)
expr: count(DISTINCTKEY._col1:2._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1,_col2, _col3
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
expr: _col2
type: bigint
expr: _col3
type: bigint
outputColumnNames: _col0, _col1,_col2, _col3
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format:org.apache.hadoop.mapred.TextInputFormat
output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
查看执行计划(省去非关键部分):
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages:Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 is a root stage
本文出自 “MIKE老毕的WIKI” 博客,请务必保留此出处