Hive tutorial 7 – Hive performance tuning explain and analyze utilities

Hive provides an EXPLAIN command to return a query execution plan without running the query. We can use an EXPLAIN command for queries if we have a doubt or a concern about performance. The EXPLAIN command will help to see the difference between two or more queries for the same purpose.

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query

EXTENDED – This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree.

DEPENDENCY – This provides a JSON format output that contains a list of tables and partitions that the query depends on.

AUTHORIZATION – This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any.

A typical query plan contains the following three sections

Abstract syntax tree (AST) – Hive uses a pacer generator called ANTLR to automatically generate a tree of syntax for HQL. We can usually ignore this most of the time.

Stage dependencies – This lists all dependencies and number of stages used to run the query.

Stage plans – It contains important information, such as operators and sort orders, for running the job.

The following is what a typical query plan looks like. From the following example, we can see that the AST section is not shown since the EXTENDED keyword is not used with EXPLAIN. In the STAGE PLANS section, Stage-1 has one map and reduce referred to by Map Operator Tree and Reduce Operator Tree. Inside each Map/Reduce Operator Tree section, all operators corresponding to Hive query keywords as well as expressions and aggregations are listed.

EXPLAIN select distinct(element_id) from portmaps_table;


Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: portmaps_table
Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: element_id (type: string)
outputColumnNames: element_id
Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: element_id (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 4 Data size: 421 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 421 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
The ANALYZE statement

Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column
level. These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query. Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED/FORMATTED statement.


ANALYZE TABLE portmaps_table PARTITION(datestamp='2017-03-02') COMPUTE STATISTICS;

DESCRIBE EXTENDED portmaps_table PARTITION(datestamp='2017-03-02') ;

Partition(values:[2017-03-02], dbName:test, tableName:portmaps_table, createTime:1488440808, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:element_id, type:string, comment:null), FieldSchema(name:sub_element_id, type:string, comment:null), FieldSchema(name:service_id, type:string, comment:null), FieldSchema(name:update_date, type:string, comment:null), FieldSchema(name:datestamp, type:string, comment:null)], location:hdfs://nameservice/user/test/hive_warehouse/PortMaps_Table/datestamp=2017-03-02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{orc.compress=ZLIB, EXTERNAL=TRUE, numFiles=1, serialization.format=1, transient_lastDdlTime=1502351855, COLUMN_STATS_ACCURATE=true, totalSize=842, numRows=4, rawDataSize=0})

DESCRIBE FORMATTED portmaps_table PARTITION(datestamp=’2017-03-02′) ;

Which gives a formatted output.

Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting

SET hive.stats.autogather=true;