Hive Tutorial 2 – hive dml, hive inner join, hive outer join, hive cross join, hive map join, hive left semi join, hive union all , hive union, hive intercept and hive minus

Data Querying

The select statement is used to query the data and SELECT * scans the whole table or file without triggering MapReduce jobs, so it runs faster than SELECT <column_name>. In the latest version of hive SELECT <column_name> FROM <table_name> LIMIT n, can also avoid triggering the MapReduce job if the Hive fetch task conversion is enabled by setting hive.fetch.task.conversion = more.

Nested SELECT

When using subqueries, an alias should be given for the subquery Or else, Hive will report exception.


SELECT portid,neid FROM (SELECT * FROM service WHERE id like '%port%') t1;

A subquery that uses EXIST or NOT EXIST must refer to both inner and outer expression.


SELECT * FROM table1 d WHERE d.service_id IN (SELECT service_id FROM table2);

SELECT * FROM table1 d WHERE d.service_id NOT IN (SELECT service_id FROM table2);

SELECT * FROM table1 d WHERE EXISTS (SELECT service_id FROM table2 s where d.service_id=s.service_id);

Joins

Hive JOIN is used to combine rows from two or more tables together. Hive supports common JOIN operations such as what’s in the RDBMS, for example, JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, and CROSS JOIN. However, Hive only supports equal JOIN instead of unequal JOIN, because unequal JOIN is difficult to be converted to MapReduce jobs.

The INNER JOIN in Hive uses JOIN keywords, which return rows meeting the JOIN conditions from both left and right tables.


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
INNER JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'

The JOIN operation can be performed among more tables


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
INNER JOIN table2 s ON h.id =s.id AND h.name = s.name
INNER JOIN table3 k ON h.id =k.id
WHERE h.datestamp='${date}'

Self-join is a special JOIN where one table joins itself. When doing such joins, a different alias should be given to distinguish the same table


SELECT s.service_id,h.ds,h.qds,h.ads FROM table1 h INNER JOIN table1 s ON h.id =s.id WHERE h.datestamp='${date}';

Implicit join is a JOIN operation without using the JOIN keyword


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h, table2 s
where h.id =s.id AND h.name = s.name;

If the JOIN operation uses different columns in join conditions then it will create an additional MapReduce


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
INNER JOIN table2 s ON h.id =s.id AND h.name = s.name
INNER JOIN table3 k ON h.name =k.name
WHERE h.datestamp='${date}'

If JOIN uses different columns in the join conditions, it will request additional job stages to complete the join. If the JOIN operation uses the same column in the join conditions, Hive will join on this condition using one stage. When JOIN is performed between multiple tables, the MapReduce jobs are created to process the data in the HDFS. Each of the jobs is called a stage. Usually, it is suggested for JOIN statements to put the big table right at the end for better performance as well as avoiding Out Of Memory exceptions, because the last table in the sequence is streamed through the reducers where the others are buffered in the reducer by default. Also, a hint, such as /*+STREAMTABLE (table_name)*/, can be specified to tell which table is streamed.


SELECT /*+ STREAMTABLE(table1) */ s.service_id,h.ds,h.qds,h.ads
FROM table1 h
INNER JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'

Besides INNER JOIN, Hive also supports regular OUTER JOIN and FULL JOIN as explained below.

Left outer join – This returns all rows in the left table and matched rows in the right table.If there is no match in the right table, return null in the right table.


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
LEFT JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'

right outer join – This returns all rows in the right table and matched rows in the left table. If there is no match in the left table,return null in the left table.


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
RIGHT JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'

full outer join – This returns all rows in both the tables and matched rows in both the tables. If there is no match in the left or right
table, return null instead.


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
FULL JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'

cross join – This returns all row combinations in both the tables to produce a Cartesian product.


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
CROSS JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'

The same can be achieved using JOIN without condition or with the always true condition, such as 1 = 1.


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
JOIN table2 s;

SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
JOIN table2 s ON 1=1;

In addition, JOIN always happens before WHERE. If possible, push conditions such as the JOIN conditions rather than WHERE conditions to filter the result set after JOIN immediately. What’s more, JOIN is NOT commutative! It is always left associative no matter whether they are LEFT JOIN or RIGHT JOIN.

Although Hive does not support unequal JOIN explicitly, there are workarounds using CROSS JOIN and WHERE conditions mentioned


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
CROSS JOIN table2 s where h.id <> s.id

MAP JOIN

The MAPJOIN statement means doing the JOIN operation only by map without the reduce job. The MAPJOIN statement reads all the data from the small table to memory and broadcasts to all maps. During the map phase, the JOIN operation is performed by comparing each row of data in the big table with small tables against the join conditions. Because there is no reduce needed, the JOIN performance is improved. When the hive.auto.convert.join setting is set to true, Hive automatically converts the JOIN to MAPJOIN at runtime if possible instead of checking the map join hint. In addition, MAPJOIN can be used for unequal joins to improve performance since both MAPJOIN and WHERE are performed in the map phase.


SELECT /*+ MAPJOIN(table1) */ s.service_id,h.ds,h.qds,h.ads
FROM table1 h
INNER JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'

The bucket map join is a special type of MAPJOIN that uses bucket columns (the column specified by CLUSTERED BY in the CREATE table statement) as the join condition. Instead of fetching the whole table as done by the regular map join, bucket map join only fetches the required bucket data. To enable bucket map join, we need to set hive.optimize.bucketmapjoin = true and make sure the buckets number is a multiple of each other. If both tables joined are sorted and bucketed with the same number of buckets, a sort-merge join can be performed instead of caching all small tables in the memory. The following additional settings are needed to enable this behavior.


SET hive.optimize.bucketmapjoin = true;
SET hive.optimize.bucketmapjoin.sortedmerge = true;
SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

LEFT SEMI JOIN

A LEFT SEMI JOIN only returns the records from the left-hand table. The restriction of using LEFT SEMI JOIN is that the right-hand
side table should only be referenced in the join condition, but not in WHERE or SELECT clauses


SELECT h.service_id,h.ds,h.qds,h.ads
FROM table1 h
LEFT SEMI JOIN table2 s
ON h.id =s.id;

UNION ALL

UNION ALL will data from both tables, including duplicates.


SELECT h.service_id FROM table1 h
UNION ALL
SELECT h.service_id FROM table2 s;

For other set operations supported by RDBMS, such as UNION, INTERCEPT, and MINUS, we can use SELECT with the WHERE condition to implement them as follows

UNION

SELECT DISTINCT service_id FROM
(SELECT h.service_id FROM table1 h
UNION ALL
SELECT h.service_id FROM table2 s) unique_service;

INTERCEPT – Can be implemented using the inner join


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
INNER JOIN table2 s
ON h.id =s.id AND h.name = s.name

MINUS – The table1 table implements MINUS on table2 using OUTER JOIN.


SELECT s.service_id,h.ds,h.qds,h.ads
FROM table1 h
LEFT JOIN table2 s
ON h.id =s.id AND h.name = s.name
WHERE h.datestamp='${date}'