The common join is also called reduce side join. It is a basic join in Hive and works for most of the time. For common joins, we need to make sure the big table is on the right-most side or specified by hit, as follows.
/*+ STREAMTABLE(stream_table_name) */.
Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited.Hive can convert map join automatically with the following settings.
SET hive.auto.convert.join=true; --default false SET hive.mapjoin.smalltable.filesize=600000000; --default 25M SET hive.auto.convert.join.noconditionaltask=true; --default false. Set to true so that map join hint is not needed SET hive.auto.convert.join.noconditionaltask.size=10000000; --The default value controls the size of table to fit in memory
Once autoconvert is enabled, Hive will automatically check if the smaller table file size is bigger than the value specified by hive.mapjoin.smalltable.filesize, and then Hive will convert the join to a common join. If the file size is smaller than this threshold, it will try to convert the common join into a map join. Once autoconvert join is enabled, there is no need to provide the map join hints in the query.
Bucket map join
Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings.
SET hive.auto.convert.join=true; --default false SET hive.optimize.bucketmapjoin=true; --default false
In bucket map join, all the join tables must be bucket tables and join on buckets columns. In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables.
Sort merge bucket (SMB) join
SMB is the join performed on the bucket tables that have the same sorted, bucket, and join condition columns. It reads data from both bucket tables and performs common joins (map and reduce triggered) on the bucket tables. We need to enable the following properties to use SMB.
SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat SET hive.auto.convert.sortmerge.join=true; SET hive.optimize.bucketmapjoin=true; SET hive.optimize.bucketmapjoin.sortedmerge=true; SET hive.auto.convert.sortmerge.join.noconditionaltask=true;
Sort merge bucket map (SMBM) join
SMBM join is a special bucket join but triggers map-side join only. It can avoid caching all rows in the memory like map join does. To perform SMBM joins, the join tables must have the same bucket, sort, and join condition columns. To enable such joins, we need to enable the following settings.
SET hive.auto.convert.join=true; SET hive.auto.convert.sortmerge.join=true SET hive.optimize.bucketmapjoin=true; SET hive.optimize.bucketmapjoin.sortedmerge=true; SET hive.auto.convert.sortmerge.join.noconditionaltask=true; SET hive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;
When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens.
SET hive.optimize.skewjoin=true; --If there is data skew in join, set it to true. Default is false. SET hive.skewjoin.key=100000; --This is the default value. If the number of key is bigger than this, the new keys will send to the other unused reducers.
Note : Skew data could happen on the GROUP BY data too. To optimize it, we need to do the following settings to enable skew data optimization in the GROUP BY result.
Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.