Hive tutorial 4 – hive ordering, sorting data and hive transactions

ORDER AND SORT

ORDER BY (ASC|DESC): This is similar to the RDBMS ORDER BY statement. A sorted order is maintained across all of the output from every reducer. It performs the global sort using only one reducer, so it takes a longer time to return the result. Usage with LIMIT is strongly recommended for ORDER BY. When hive.mapred.mode = strict (by default, hive.mapred.mode = nonstrict) is set and we do not specify LIMIT, there are exceptions.


SELECT name FROM service_table_external ORDER BY portid DESC;

SORT BY (ASC|DESC): This indicates which columns to sort when ordering the reducer input records. This means it completes sorting before sending data to the reducer. The SORT BY statement does not perform a global sort and only makes sure data is locally sorted in each reducer unless we set mapred.reduce.tasks=1. In this case, it is equal to the result of ORDER BY.


SELECT name FROM service_table_external SORT BY portid DESC;

DISTRIBUTE BY – Rows with matching column values will be partitioned to the same reducer. When used alone, it does not guarantee sorted input to the reducer. The DISTRIBUTE BY statement is similar to GROUP BY in RDBMS in terms of deciding which reducer to distribute the mapper output to. When using with SORT BY, DISTRIBUTE BY must be specified before the SORT BY statement. And, the column used to distribute must appear in the select column list.


SELECT port_id FROM service_table_external DISTRIBUTE BY device_id;

Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:44 Invalid table alias or column reference ‘port_id’: (possible column names are: device_id) (state=42000,code=10004)

The below query will work with no issues


SELECT port_id,device_id FROM service_table_external DISTRIBUTE BY device_id;

Used with SORT BY


SELECT port_id,device_id FROM service_table_external DISTRIBUTE BY device_id sort by port_id;

CLUSTER BY – This is a shorthand operator to perform DISTRIBUTE BY and SORT BY operations on the same group of columns. And, it is sorted locally in each reducer. The CLUSTER BY statement does not support ASC or DESC yet. Compared to ORDER BY, which is globally sorted, the CLUSTER BY operation is sorted in each distributed group. To fully utilize all the available reducers when doing a global sort, we can do CLUSTER BY first and then ORDER BY.


SELECT port_id,device_id FROM service_table_external CLUSTER BY device_id;

Common functions

FROM_UNIXTIME(UNIX_TIMESTAMP()) – performs the same function as SYSDATE in Oracle. It dynamically returns the current date-time in the Hive server

SELECT FROM_UNIXTIME(UNIX_TIMESTAMP()) AS current_time FROM service_table_external LIMIT 1;

UNIX_TIMESTAMP() – statement can be used to compare two dates or can be used after ORDER BY to properly order the different string types of a date value, such as ORDER BY UNIX_TIMESTAMP(string_date, ‘dd-MM-yyyy’)

SELECT (UNIX_TIMESTAMP (‘2017-08-07 18:00:00’) – UNIX_TIMESTAMP(‘2017-09-10 11:00:00’))/60/60/24 AS daydiff FROM service_table_external LIMIT 1;

TO_DATE – Removes the hours, minutes, and seconds from a date.

REVERSE – used to reverse the order of each letter in a string

SPLIT – used to tokenize the string using a specified tokenizer

Virtual columns – special function type of columns in Hive. Right now, Hive offers two virtual columns: INPUT__FILE__NAME and BLOCK__OFFSET__INSIDE__FILE. The INPUT__FILE__NAME function is the input file’s name for a mapper task. The BLOCK__OFFSET__INSIDE__FILE function is the current global file position or current block’s file offset if the file is compressed.

isnull – Functions to check for null values.

isnotnull – Functions to check for non null values.

assert_true – assert_true, throw an exception if ‘condition’ is not true.

SELECT assert_true(port_id IS NULL) FROM service;

elt(n, str1, str2, …) – returns the n-th string

SELECT elt(2,’New York’,’Montreal’,’Toronto’) FROM employee LIMIT 1 returns Montreal

SELECT current_database(); – Return the name of current_database

Transactions

Since Hive version 0.13.0, Hive fully supports row-level transactions by offering full Atomicity, Consistency, Isolation, and Durability (ACID) to Hive. For now, all the transactions are autocommuted and only support data in the Optimized Row Columnar (ORC) file (available since Hive 0.11.0) format and in bucketed tables.


The following configuration parameters must be set appropriately to turn on transaction support in Hive:
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;

The SHOW TRANSACTIONS command is added since Hive 0.13.0 to show currently open and aborted transactions in the system:

SHOW TRANSACTIONS;

Since Hive 0.14.0, the INSERT VALUE, UPDATE, and DELETE commands are added to operate rows with the following syntax.


INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2]...)] VALUES values_row [, values_row …];

UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

DELETE FROM tablename [WHERE expression]