Hive tutorial 6 – Analytic functions RANK, DENSE_RANK, ROW_NUMBER, CUME_DIST, PERCENT_RANK, NTILE, LEAD, LAG, FIRST_VALUE, LAST_VALUE and Sampling

Analytic functions are usually used with OVER, PARTITION BY, ORDER BY, and the windowing specification.

Standard aggregations – COUNT(), SUM(), MIN(), MAX(),AVG() are the standard aggregation.

RANK – It ranks items in a group, such as finding the top N rows for specific conditions.

SELECT a.service_id, a.network_device_id, 
a.device_sub_element_id, a.timestamp
FROM (
SELECT s.service_id,h.network_device_id,h.device_sub_element_id,
h.timestamp,rank() over 
(partition by s.service_id order by h.timestamp DESC) as rank
FROM hlog_processed_table_name h
INNER JOIN service_table_name s
ON h.network_device_id =s.element_name 
AND h.device_sub_element_id = s.subelement_name
WHERE h.datestamp='date' AND s.datestamp = 'date'
) a
WHERE a.rank=1;

DENSE_RANK – It is similar to RANK, but leaves no gaps in the ranking sequence when there are ties. For example, if we rank a match using DENSE_RANK and had two players tie for second place, we would see that the two players were in second place and that the next person is ranked as third. However, the RANK function would also rank two people in second place, but the next person would be in fourth place.

SELECT a.service_id, a.network_device_id,
 a.device_sub_element_id, a.timestamp
FROM (
SELECT s.service_id,h.network_device_id,
h.device_sub_element_id,h.timestamp,DENSE_RANK()
 over (partition by s.service_id order by h.timestamp DESC)
 as rank
FROM hlog_processed_table_name h
INNER JOIN service_table_name s
ON h.network_device_id =s.element_name 
AND h.device_sub_element_id = s.subelement_name
WHERE h.datestamp='date' AND s.datestamp = 'date'
) a
WHERE a.rank=1;

ROW_NUMBER – It assigns a unique sequence number starting from 1 to each row according to the partition and order specification. ROW_NUMBER() is a windowing function so it needs to be used in conjunction with an OVER clause.

select ROW_NUMBER() OVER 
(PARTITION BY sub_element_id),* from service_table;

CUME_DIST – It computes the number of rows whose value is smaller or equal to the value of the total number of rows divided by the current row.

CUME_DIST = total number of rows/current row;

select CUME_DIST() OVER
 (order BY sub_element_id),* from service_table;

If we have four records with different value in sub_element_id then we will have 0.25,0.5,0.75,1 and if the sub_element_id is same then we will have 1 for all the records.

PERCENT_RANK: It is similar to CUME_DIST, but it uses rank values rather than row counts. Therefore, it returns the percent rank of a value relative to a group of values.

PERCENT_RANK = (current rank – 1 )/(total number of rows – 1)

select PERCENT_RANK() OVER
 (order BY sub_element_id),* from service_table;

NTILE – It divides an ordered dataset into number of buckets and assigns an appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row.

select NTILE(2) OVER
 (order BY sub_element_id),* from portmaps_table;

If we have 4 records the records will be split into 2 bucket as 2 is passed to ntile function . So two records will have value as 1,1 and the other two 2,2.

LEAD – The LEAD function, lead(value_expr[,offset[,default]]), is used to return data from the next row. The number (value_expr) of rows to lead can optionally be specified. If the number of rows (offset) to lead is not specified, the lead is one row by default. It returns default or null when the default is not specified and the lead for the current row extends beyond the end of the window.

LAG – The LAG function, lag(value_expr[,offset[,default]]), is used to access data from a previous row. The number (value_expr) of rows to lag can optionally be specified. If the number of rows (offset) to lag is not specified, the lag is one row by default. It returns [,default] or null when the default is not specified and the lag for the current row extends beyond the end of the window.

Lets say below is the input data in the table

./shelf=0/slot=4/port=6
./shelf=0/slot=5/port=0
./shelf=0/slot=5/port=1
./shelf=0/slot=5/port=24
select lag(sub_element_id, 1,404) over
 (order by sub_element_id) as previous, 
sub_element_id, lead(sub_element_id, 1,404)
 over (order by sub_element_id) 
as next from portmaps_table;

Output will be as below

1. 404,./shelf=0/slot=4/port=6,
./shelf=0/slot=5/port=0
2. ./shelf=0/slot=4/port=6,
./shelf=0/slot=5/port=0,./shelf=0/slot=5/port=1
3. ./shelf=0/slot=5/port=0,
./shelf=0/slot=5/port=1,./shelf=0/slot=5/port=24
4. ./shelf=0/slot=5/port=1,
./shelf=0/slot=5/port=24,404

FIRST_VALUE – It returns the first result from an ordered set

Lets say below is the input data

1. Test_Device_Id,/shelf=0/slot=4/port=6,
OGEAtest110,07/06/2009 13:15,2017-03-02
2. Test_Device_Id,/shelf=0/slot=5/port=0,
OGEATest_Device_Id050,07/06/2009 13:15,2017-03-02
3. Test_Device_Id,/shelf=0/slot=5/port=1,
OGEA199999999,12/26/2009 10:24,2017-03-02
4. Test_Device_Id,/shelf=0/slot=5/port=24,
OGEATest_Device_Id0524,08/19/2009 06:44,2017-03-02
select first_value(sub_element_id) 
over(order by sub_element_id),* from dummy_table;

Output will be

1./shelf=0/slot=4/port=6,
Test_Device_Id,/shelf=0/slot=4/port=6,
OGEAtest110,07/06/2009 13:15,2017-03-02
2./shelf=0/slot=4/port=6,
Test_Device_Id,/shelf=0/slot=5/port=0,
OGEATest_Device_Id050,07/06/2009 13:15,2017-03-02
3./shelf=0/slot=4/port=6,
Test_Device_Id,/shelf=0/slot=5/port=1,
OGEA199999999,12/26/2009 10:24,2017-03-02
4./shelf=0/slot=4/port=6,
Test_Device_Id,/shelf=0/slot=5/port=24,
OGEATest_Device_Id0524,08/19/2009 06:44,2017-03-02

LAST_VALUE – It returns the last result from an ordered set. For LAST_VALUE, using the default windowing clause, the result can be a little unexpected. This is because the default windowing clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which in this example means the current row will always be the last value. Changing the windowing clause to RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING gives us the result we probably expected.

select LAST_VALUE(sub_element_id) 
over(order by sub_element_id 
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
* from portmaps_table;

Output will be

1./shelf=0/slot=5/port=24,
Test_Device_Id,/shelf=0/slot=4/port=6,
OGEAtest110,07/06/2009 13:15,2017-03-02
2./shelf=0/slot=5/port=24,
Test_Device_Id,/shelf=0/slot=5/port=0,
OGEATest_Device_Id050,07/06/2009 13:15,2017-03-02
3./shelf=0/slot=5/port=24,
Test_Device_Id,/shelf=0/slot=5/port=1,
OGEA199999999,12/26/2009 10:24,2017-03-02
4./shelf=0/slot=5/port=24,
Test_Device_Id,/shelf=0/slot=5/port=24,
OGEATest_Device_Id0524,08/19/2009 06:44,2017-03-02

If we run the query without RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

select LAST_VALUE(sub_element_id)
 over(order by sub_element_id),* from portmaps_table;

Output will be

1. /shelf=0/slot=4/port=6,
Test_Device_Id,
2. /shelf=0/slot=4/port=6,
OGEAtest110,07/06/2009 13:15,2017-03-02
3. /shelf=0/slot=5/port=0,
Test_Device_Id,/shelf=0/slot=5/port=0,
OGEATest_Device_Id050,07/06/2009 13:15,2017-03-02
4. /shelf=0/slot=5/port=1,
Test_Device_Id,/shelf=0/slot=5/port=1,
OGEA199999999,12/26/2009 10:24,2017-03-02
5. /shelf=0/slot=5/port=24,Test_Device_Id,
/shelf=0/slot=5/port=24,
OGEATest_Device_Id0524,08/19/2009 06:44,2017-03-02

The PARTITION BY statement is similar to the GROUP BY clause. It divides the rows into groups containing identical values in one or more partitions by columns. These logical groups are known as partitions, which is not the same term used for partition tables. Omitting the PARTITION BY statement applies the analytic operation to all the rows in the table. The [ORDER BY <….>] clause is like the ORDER BY expr [ASC|DESC] clause. The ORDER BY clause is the same as the regular ORDER BY clause. It makes sure the rows produced by the PARTITION BY clause are ordered by specifications, such as ascending or descending order.

For row type windows, the definition is in terms of row numbers before or after the current row. The general syntax of the row window clause is as follows:

ROWS BETWEEN <start_expr> AND <end_expr>

The <start_expr> can be any one of UNBOUNDED PRECEDING,CURRENT ROW and N PRECEDING or FOLLOWING
The <end_expr> can be any one of the following UNBOUNDED FOLLOWING,CURRENT ROW,N PRECEDING or FOLLOWING

BETWEEN … AND: Use the BETWEEN…AND clause to specify the start point and end point for the window. The first expression (before AND) defines the start point and the second expression (after AND) defines the end point. If we omit BETWEEN…AND (such as ROWS N PRECEDING or ROWS UNBOUNDED PRECEDING), Hive considers it as the start point, and the end point defaults to the current row.

N PRECEDING or FOLLOWING – This indicates N rows before or after the current row.

UNBOUNDED PRECEDING  – This indicates that the window starts at the first row of the partition. This is the start point specification and cannot be used as an end point specification.

UNBOUNDED FOLLOWING – This indicates that the window ends at the last row of the partition. This is the end point specification and cannot be used as a start point specification.

UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING – This indicates the first and last row for every row, meaning all rows in the table.

Sampling

When data volume is extra large, we may need to find a subset of data to speed up data analysis. Here it comes to a technique used to select and analyze a subset of data in order to identify patterns and trends. In Hive, there are three ways of sampling data: random sampling, bucket table sampling, and block sampling.

Random sampling – uses the RAND() function and LIMIT keyword to get the sampling of data. The DISTRIBUTE and SORT keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY RAND() statement can also achieve the same purpose, but the performance is not good.

SELECT * FROM <Table_Name> DISTRIBUTE BY RAND()
 SORT BY RAND() LIMIT <N rows to sample>;

Bucket table sampling – is a special sampling optimized for bucket tables. The colname value specifies the column where to sample the data. The RAND() function can also be used when sampling is on the entire rows. If the sample column is also the CLUSTERED BY column, the
TABLESAMPLE statement will be more efficient

SELECT * FROM <Table_Name> 
TABLESAMPLE(BUCKET <specified bucket number to sample> 
OUT OF <total number of buckets> ON [colname|RAND()]) 
table_alias;
SELECT name FROM employee_id_buckets 
TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand()) a;

Block sampling – allows Hive to randomly pick up N rows of data, percentage (n percentage) of data size, or N byte size of data. The sampling granularity is the HDFS block size.

SELECT * FROM <Table_Name> 
TABLESAMPLE(N PERCENT|ByteLengthLiteral|N ROWS) s;

SELECT name FROM employee_id_buckets 
TABLESAMPLE(4 ROWS) a;

SELECT name FROM employee_id_buckets 
TABLESAMPLE(10 PERCENT) a;