default mappper, reducer, partitioner, multithreadedmapper and split size configuration in hadoop and mapreduce

What will be the mapper,reducer and the partitioner that will be used in mapreduce program if we dont specify any in the driver code is explained in this article.

Default Mappper Job

Below is the mapper code used if mapper is not provided in the driver . The default mapper is just the Mapper class, which writes the input key and value unchanged to the output. The default mapper is just the Mapper class, which writes the input key and value unchanged to the output. Mapper is a generic type, which allows it to work with any key or value types.


public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}

The number of map tasks is equal to the number of splits that the input is turned into, which is driven by the size of the input and the file’s block size (if the file is in HDFS). The options for controlling split size. Below are the properties that can be used

mapreduce.input.fileinputformat.split.minsize is the smallest valid size in bytes for a file split the default value is 1 byte. mapreduce.input.fileinputformat.split.maxsize is the largest valid size in bytes for a file split.dfs.blocksize is the size of a block in HDFS and the default value is 128 MB.

Applications may impose a minimum split size. By setting this to a value larger than the block size, they can force splits to be larger than a block. There is no good reason for doing this when using HDFS, because doing so will increase the number of blocks that are not local to a map task.

The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block.

The split size is calculated by the following formula max(minimumSize, min(maximumSize, blockSize)) and by default minimumSize < blockSize < maximumSize.

Controlling the split size

File split properties

When the input format derives from FileInputFormat, the InputSplit returned by the method getInputSplit() can be cast to a FileSplit to access the file information.

 

Default Partitioner

The default partitioner is HashPartitioner, which hashes a record’s key to determine which partition the record belongs in. Each partition is processed by a reduce task, so the number of partitions is equal to the number of reduce tasks for the job. The key’s hash code is turned into a nonnegative integer by bitwise ANDing it with the largest integer value. It is then reduced modulo the number of partitions to find the index of the partition that the record belongs in.

By default, there is a single reducer, and therefore a single partition; the action of the partitioner is irrelevant in this case since everything goes into one partition. However, it is important to understand the behavior of HashPartitioner when you have more than one reduce task. Assuming the key’s hash function is a good one, the records will be allocated evenly across reduce tasks, with all records that share the same key being processed by the same reduce task.


public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}

Default Reducer

The default reducer is Reducer, again a generic type, which simply writes all its input to its output. Records are sorted by the MapReduce system before being presented to the reducer. Records are sorted by the MapReduce system before being presented to the reducer. If the keys are numerical, the keys are sorted numerically.


public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context)
throws IOException, InterruptedException {
for (VALUEIN value : values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
}

MultithreadedMapper

MultithreadedMapper is an implementation that runs mappers concurrently in a configurable number of threads (set by mapreduce.mapper.multithreadedmapper.threads). For most data processing tasks, it confers no advantage over the default implementation. However, for mappers that spend a long time processing each record—because they contact external servers, for example—it allows multiple mappers to run in one JVM with little contention.Below is the snippet of driver configuration


MultithreadedMapper.setMapperClass(sampleJob, TestMapper.class);
MultithreadedMapper.setNumberOfThreads(sampleJob, 8);

Hadoop Streaming

Hadoop streaming is one of the most important utility in Hadoop distribution. The Streaming interface of Hadoop allows you to write Map-Reduce program in any language of your choice, which can work with STDIN and STDOUT. So, Streaming can also be defined as a generic Hadoop API which allows Map-Reduce programs to be written in virtually any language. In this approach Mapper receive input on STDIN and emit output on STDOUT.

Apache Hadoop framework and MapReduce programming is the industry standard for processing large volume of data. The MapReduce programming framework is used to do the actual processing and logic implementation. The Java MapReduce API is the standard option for writing MapReduce program. But Hadoop Streaming API provides options to write MapReduce job in other languages. This is one of the best flexibility available to MapReduce programmers having experience in other languages apart from Java. Even executables can also be used with Streaming API to work as a MapReduce job. The only condition is that the program/executable should be able to take input from STDIN and produce output at STDOUT.