Hive tutorial 8 – Hive performance tuning using Job and query optimization with local mode, jvm reuse and parallel execution

Local mode

Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure Hadoop to run in fully distributed mode. When the data to process is small, it is an overhead to start distributed data processing since the launching time of the fully distributed mode takes more time than the job processing time. Hive supports automatic conversion of a job to run in local mode with the following settings.


SET hive.exec.mode.local.auto=true; --default false

SET hive.exec.mode.local.auto.inputbytes.max=50000000;

SET hive.exec.mode.local.auto.input.files.max=5; --default 4

A job must satisfy the following conditions to run in the local mode

1. The total input size of the job is lower than hive.exec.mode.local.auto.inputbytes.max

2. The total number of map tasks is less than hive.exec.mode.local.auto.input.files.max

3. The total number of reduce tasks required is 1 or 0

JVM reuse

By default, Hadoop launches a new JVM for each map or reduce job and runs the map or reduce task in parallel. When the map or reduce job is a lightweight job running only for a few seconds, the JVM startup process could be a significant overhead. The MapReduce framework (version 1 only, not Yarn) has an option to reuse JVM by sharing the JVM to run mapper/reducer serially instead of parallel. JVM reuse applies to map or reduce tasks in the same job. Tasks from different jobs will always run in a separate JVM. To enable the reuse, we can set the maximum number of tasks for a single job for JVM reuse using the mapred.job.reuse.jvm.num.tasks property. Its default value is 1.We can also set the value to –1 to indicate that all the tasks for a job will run in the same JVM.

SET mapred.job.reuse.jvm.num.tasks=5;
Parallel execution

Hive queries commonly are translated into a number of stages that are executed by the default sequence. These stages are not always dependent on each other. Instead, they can run in parallel to save the overall job running time. Parallel execution will increase the cluster utilization. If the utilization of a cluster is already very high, parallel execution will not help much in terms of overall performance. We can enable this feature with the following settings


SET hive.exec.parallel=true; -- default false

SET hive.exec.parallel.thread.number=16;  -- default 8, it defines the max number for running in parallel