Hive provides an EXPLAIN command to return a query execution plan without running the query. We can use an EXPLAIN command for queries if we have a doubt or a concern about performance. The EXPLAIN command will help to see the difference between two or more queries for the same purpose.
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query
EXTENDED – This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree.
DEPENDENCY – This provides a JSON format output that contains a list of tables and partitions that the query depends on.
AUTHORIZATION – This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any.
A typical query plan contains the following three sections
Abstract syntax tree (AST) – Hive uses a pacer generator called ANTLR to automatically generate a tree of syntax for HQL. We can usually ignore this most of the time.
Stage dependencies – This lists all dependencies and number of stages used to run the query.
Stage plans – It contains important information, such as operators and sort orders, for running the job.
The following is what a typical query plan looks like. From the following example, we can see that the AST section is not shown since the EXTENDED keyword is not used with EXPLAIN. In the STAGE PLANS section, Stage-1 has one map and reduce referred to by Map Operator Tree and Reduce Operator Tree. Inside each Map/Reduce Operator Tree section, all operators corresponding to Hive query keywords as well as expressions and aggregations are listed.
EXPLAIN select distinct(element_id) from portmaps_table;
Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: portmaps_table Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: element_id (type: string) outputColumnNames: element_id Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: element_id (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 8 Data size: 842 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 4 Data size: 421 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 4 Data size: 421 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
The ANALYZE statement
Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column
level. These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query. Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED/FORMATTED statement.
ANALYZE TABLE portmaps_table PARTITION(datestamp='2017-03-02') COMPUTE STATISTICS; DESCRIBE EXTENDED portmaps_table PARTITION(datestamp='2017-03-02') ; Partition(values:[2017-03-02], dbName:test, tableName:portmaps_table, createTime:1488440808, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:element_id, type:string, comment:null), FieldSchema(name:sub_element_id, type:string, comment:null), FieldSchema(name:service_id, type:string, comment:null), FieldSchema(name:update_date, type:string, comment:null), FieldSchema(name:datestamp, type:string, comment:null)], location:hdfs://nameservice/user/test/hive_warehouse/PortMaps_Table/datestamp=2017-03-02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{orc.compress=ZLIB, EXTERNAL=TRUE, numFiles=1, serialization.format=1, transient_lastDdlTime=1502351855, COLUMN_STATS_ACCURATE=true, totalSize=842, numRows=4, rawDataSize=0})
DESCRIBE FORMATTED portmaps_table PARTITION(datestamp=’2017-03-02′) ;
Which gives a formatted output.
Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting
SET hive.stats.autogather=true;