mapreduce partition pruning concept

Partition pruning configures the way the framework picks input splits and drops files from being loaded into MapReduce based on the name of the file.You have a set of data that is partitioned by a predetermined value, which you can use to dynamically load the data based on what is requested by the application.Typically, all the data loaded into a MapReduce job is assigned into map tasks and read in parallel. If entire files are going to be thrown out based on the query, loading all of the files is a large waste of processing time. By partitioning the data by a common value, you can avoid significant amounts of processing time by looking only where the data would exist. For example, if you are commonly analyzing data based on date ranges, partitioning your data by date will make it so you only need to load the data inside of that range.

The InputFormat is where this pattern comes to life. The getSplits method is where we pay special attention, because it determines the input splits that will be created, and thus the number of map tasks. While the configuration is typically a set of files, configuration turns into more of a query than a set of file paths. For instance, if data is stored on a file system by date, the InputFormat can accept a date range as input, then determine which folders to pull into the MapReduce job. If data is sharded in an external service by date, say 12 shards for each month, only one shard needs to be read by the MapReduce job when looking for data in March. The key here is that the input format determines where the data comes from based on a query, rather than passing in a set of files.

The RecordReader implementation depends on how the data is being stored. If it is a file-based input, something like a LineRecordReader can be used to read key/value pairs from a file. If it is an external source, you’ll have to customize something more to your needs.

Partition pruning changes only the amount of data that is read by the MapReduce job, not the eventual outcome of the analytic. The main reason for partition pruning is to reduce the overall processing time to read in data. This is done by ignoring input that will not produce any output before it even gets to a map task.