Hive partitioning is one of the most effective methods to improve the query performance on larger tables. The query with partition filtering will only load the data in the specified partitions (subdirectories), so it can execute much faster than a normal query that filters by a non-partitioning field. The selection of partition key is always an important factor for performance. It should always be a low cardinal attribute to avoid many subdirectories overhead.
The following are some commonly used dimensions as partition keys:
1. Partitions by date and time – Use date and time, such as year, month, and day (even hours), as partition keys when data is associated with the time dimension
2. Partitions by locations – Use country, territory, state, and city as partition keys when data is location related
3. Partitions by business logics – Use department, sales region, applications, customers, and so on as partitioned keys when data can be separated evenly by some business logic
Similar to partitioning, a bucket table organizes data into separate files in the HDFS.Bucketing can speed up the data sampling in Hive with sampling on buckets. Bucketing can also improve the join performance if the join keys are also bucket keys because bucketing ensures that the key is present in a certain bucket.
Different from partition, the bucket corresponds to segments of files in HDFS. For example, lets say for X table we are using datestamp as the toplevel partition. If there is a further request to use the neid as the second level of partition, it leads to many deep and small partitions and directories. So we can bucket the X table using neid as the bucket column. The value of this column will be hashed by a user-defined number into buckets. The records with the same neid will always be stored in the same bucket that is segment of files. By using buckets, Hive can easily and efficiently do sampling and map side joins as the data belonging to the same key will be available in the same file.
Below is an example of creating table with buckets
CREATE EXTERNAL TABLE baseline_table( index string, vdr_id string, sys_vdr_id string, ver_num string, supplier_id string, neid string, portid string) CLUSTERED BY (neid) INTO 22 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'hdfs://nameservice/user/queuename/hive_warehouse/baseline_table';
The index in Hive provides key-based data view and better data access for certain operations, such as WHERE, GROUP BY, and JOIN. We can use index is a cheaper alternative than full table scans. The command to create an index in Hive is
CREATE INDEX neid_index ON TABLE Test_Table_Index (Element_ID) AS 'COMPACT' WITH DEFERRED REBUILD;
In addition to the COMPACT keyword used in the preceding example, Hive also supports BITMAP indexes with less different values
CREATE INDEX neid_index ON TABLE Test_Table_Index (Element_ID) AS 'BITMAP' WITH DEFERRED REBUILD;
The WITH DEFERRED REBUILD keyword in the preceding example prevents the index from immediately being built. To build the index, we can issue ALTER REBUILD . When data in the base table changes, the ALTER…REBUILD command must be used to bring the index up to date. This is an atomic operation, so if the index rebuilt on a table that has been previously indexed failed, the state of index remains the same.
ALTER INDEX neid_index ON Test_Table_Index REBUILD
Once the index is built, Hive will create a new index table for each index.
SHOW INDEX ON Test_Table_Index;
Will display the index details with the name . For the above index created the name is queuename__test_table_index_neid_index__.
Then, this index table can be used where we need to query the indexed columns like a regular table.
select * from queuename__test_table_index_neid_index__;
To drop an index, we can use the DROP INDEX index_name ON table_name statement as follows. However, we cannot drop the index table with a DROP
DROP INDEX neid_index ON Test_Table_Index;
Other optimization that can be considered are
1. Tez is an application framework built on Yarn that can execute complex directed acyclic graphs (DAGs) for general data-processing tasks. Tez further splits map and reduce jobs into smaller tasks and combines them in a flexible and efficient way for execution. Tez is considered a flexible and powerful successor to the MapReduce framework. To configure Hive to use Tez, we need
to overwrite the following settings from the default MapReduce.
2. Vectorization optimization processes a larger batch of data at the same time rather than one row at a time,thus significantly reducing computing overhead. Each batch consists of a column vector that is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. Files must be stored in the Optimized Row Columnar (ORC) format in order to use vectorization.