Hive tutorial 5 – Hive Data Aggregation GROUP BY, CASE, COALESCE, Distinct, Grouping SETS, ROLLUP, CUBE, HAVING

Hive offers several built-in aggregate functions, such as MAX, MIN, AVG, and so on. Hive also supports advanced aggregation by using GROUPING SETS, ROLLUP, CUBE, analytic functions, and windowing.

The Hive basic built-in aggregate functions are usually used with the GROUP BY clause. If there is no GROUP BY clause specified, it aggregates over the whole table by default. Besides aggregate functions, all other columns that are selected must also be included in the GROUP BY clause.

Aggregation without GROUP BY columns

SELECT count(*) AS row_cnt FROM service_table;

Aggregation with GROUP BY columns

SELECT count(*) AS row_cnt,service_type FROM service_table GROUP BY service_type;

The below query will result in a error as the port_id column in the select statement is not added in the group by clause.

SELECT count(*) AS row_cnt,service_type,port_id FROM service_table GROUP BY service_type;

One workaround for the above problem is to use the collect_set function, which returns a set of objects with duplicate elements eliminated as follows which will result in a output which will include a randomly selected port id .

SELECT count(*),service_type,collect_set(port_id)[0] AS random_port FROM service_table GROUP BY service_type;

The aggregate function can be used with other aggregate functions in the same select statement.

SELECT count(*) AS row_cnt,service_type,AVG(bandwidth_mb) FROM service_table GROUP BY service_type;
Aggregation with case

A CASE expression returns a value from the THEN portion of the clause.

SELECT sum(CASE WHEN service_type = 'FTH' THEN speed ELSE 0 END)/
count(CASE WHEN service_type = 'FTH' THEN 1
ELSE NULL END) AS average_speed FROM service_type;

Aggregation with COALESCE

COALESCE (T v1, T v2, …) Return the first parameter that is not NULL, or NULL if all parameters are NULL.

For example

If customer primary contact medium is email, if email is null then phone number, and if phone number is also null then address. It would be written using COALESCE as

coalesce(email,phone number,address)

SELECT sum(coalesce(service_type,0)) AS total_service,sum(if(service_type = 'FTH',1,0)) AS service_fth FROM service_table;

Nested aggregate functions are not supported in hive and below query will result in error

SELECT avg(count(*)) AS row_cnt,service_type FROM service_table GROUP BY service_type;
Distinct

Aggregate functions can also be used with the DISTINCT keyword to do aggregation on unique values.

select count(distinct service_type) as distinct_service_type from service_table;

When we use COUNT and DISTINCT together, Hive always ignores the setting such as mapred.reduce.tasks = 20 for the number of reducers used and uses only one reducer. In this case, the single reducer becomes the bottleneck when processing big volumes of data. The
workaround is to use the subquery as follows.

Below query triggers single reducer during the whole processing

select count(distinct service_type) as distinct_service_type from service_table;

Below subquery to select unique value before aggregations for better performance

SELECT count(*) AS distinct_service_type FROM (SELECT distinct service_type FROM service_table) a;

In this case, the first stage of the query implementing DISTINCT can use more than one reducer. In the second stage, the mapper will have less output just for the COUNT purpose since the data is already unique after implementing DISTINCT. As a result, the reducer will not be overloaded.

We may encounter a very special behavior when Hive deals with aggregation across columns with a NULL value. The entire row (if one column has NULL as a value in the row) will be ignored in the second row of the following example. To avoid this, we can use COALESCE to assign a default value when the column value is NULL. Lets take an example consider the below value table


val1 val2
10       20
30       40
null     50

SELECT sum(val1), sum(val1+val2) FROM value;

The above query will output 40,100 as (NULL, 50) is ignored when doing sum(val1+val2). This can overcome by using coalesce

SELECT sum(coalesce(val1,0)), sum(coalesce(val1,0)+val2) FROM value;

Now it outputs 40,150.

The hive.map.aggr property controls aggregations in the map task. The default value for this setting is false. If it is set to true, Hive will do the first-level aggregation directly in the map task for better performance, but consume more memory.

SET hive.map.aggr=true;

Grouping SETS

Hive has offered the GROUPING SETS keywords to implement advanced multiple GROUP BY operations against the same set of data. Actually, GROUPING SETS is a shorthand way of connecting several GROUP BY result sets with UNION ALL. The GROUPING SETS keyword completes all processes in one stage of jobs, which is more efficient than GROUP BY and UNION ALL having multiple stages. The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set. All GROUPING SET clauses can be logically expressed in terms of several GROUP BY queries connected by UNION.

Lets take an example . The below query will be executed in two stages

SELECT service_id,NULL as datestamp
FROM daily_baseline group by service_id
UNION ALL
SELECT NULL as service_id,datestamp
FROM daily_baseline group by datestamp;

The same can be achieved in one stage as below

SELECT service_id,datestamp
FROM daily_baseline group by service_id,datestamp grouping sets(service_id,datestamp);

With Groupset

SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )

With Group by

SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b

With Groupset

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)

With Group by

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

With Groupset

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )

With Group by

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b
UNION
SELECT null, null, SUM( c ) FROM tab1
Grouping__ID function

When aggregates are displayed for a column its value is null. This may conflict in case the column itself has some null values. There needs to be some way to identify NULL in column, which means aggregate and NULL in column, which means value. GROUPING__ID function is the solution to that. This function returns a bitvector corresponding to whether each column is present or not. For each column, a value of “1” is produced for a row in the result set if that column has been aggregated in that row, otherwise the value is “0”. This can be used to differentiate when there are nulls in the data.

ROLLUP

The ROLLUP statement enables a SELECT statement to calculate multiple levels of aggregations across a specified group of dimensions.The ROLLUP statement is a simple extension to the GROUP BY clause with high efficiency and minimal overhead to a query. Compared to GROUPING SETS that creates specified levels of aggregations, ROLLUP creates n+1 levels of aggregations, where n is the number of grouping columns. First, it calculates the standard aggregate values specified in the GROUP BY clause. Then, it creates higher-level subtotals, moving from right to left through the list of combinations of grouping columns, as shown in the following example.

GROUP BY a,b,c WITH ROLLUP

This is equivalent to the following

GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),())

The same with Group by

SELECT a, b, c FROM tab1 GROUP BY a, b,c
UNION
SELECT a, b,null FROM tab1 GROUP BY a,b,null
UNION
SELECT a, null, null FROM tab1 GROUP BY a,null,null
UNION
SELECT null, null, null FROM tab1
CUBE

The CUBE statement takes a specified set of grouping columns and creates aggregations for all of their possible combinations. If n columns are specified for CUBE, there will be 2 to the power n combinations of aggregations returned, as shown in the following example GROUP BY a,b,c WITH CUBE

GROUP BY a,b,c WITH CUBE

This is equivalent to the following:

GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())

HAVING

HAVING is added to support the conditional filtering of GROUP BY results. By using HAVING, we can avoid using a subquery after GROUP BY.

select device_id,sub_element_id,service_type from processed_table group by device_id,sub_element_id,service_type HAVING count(*)>1;

If we do not use HAVING, we can use a subquery for instance as follows

select a.device_id,a.sub_element_id,a.service_type from (select device_id,sub_element_id,service_type from processed_table group by device_id,sub_element_id,service_type) a where count(*)>1;