mapreduce example to find the inverted index of a sample

Inverted index pattern is used to generate an index from a data set to allow for faster searches or data enrichment capabilities.It is often convenient to index large data sets on keywords, so that searches can trace terms back to records that contain specific values. While building an inverted index does require extra processing up front, taking the time to do so can greatly reduce the amount of time it takes to find something. Search engines build indexes to improve search performance. Imagine entering a keyword and letting the engine crawl the Internet and build a list of pages to return to you. Such a query would take an extremely long amount of time to complete. By building an inverted index, the search engine knows all the web pages related to a keyword ahead of time and these results are simply displayed to the user. These indexes are often ingested into a database for fast query responses.

Building an inverted index is a fairly straight forward application of MapReduce because the framework handles a majority of the work. Most of the text searching systems rely on inverted index to search the documents that contains a given word or a term.

Problem to Solve

Given a employees information documents find the document in which employee information is available based on the first name.

Build an inverted index as Name ->file name. We have three sample data files

  • employee_info_1.csv
  • employee_info_2.csv
  • employee_info_3.csv

we will be creating inverted index as below so that it will be faster to search employee details based on the first name.

AARON      employee_info_1.csv      employee_info_2.csv

ABAD JR    employee_info_1.csv

ABARCA    employee_info_1.csv

Here is a sample input data attached

employee_info_1.csv

employee_info_2.csv

employee_info_3.csv

Below is the sample input for reference

First Name,Last Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate

dubert,tomasz ,paramedic i/c,fire,f,salary,,91080.00,
edwards,tim p,lieutenant,fire,f,salary,,114846.00,
elkins,eric j,sergeant,police,f,salary,,104628.00,
estrada,luis f,police officer,police,f,salary,,96060.00,
ewing,marie a,clerk iii,police,f,salary,,53076.00,
finn,sean p,firefighter,fire,f,salary,,87006.00,
fitch,jordan m,law clerk,law,f,hourly,35,,14.51

Mapper Code

In the mapper class we are splitting the input data using comma as a delimiter and then checking for some invalid data to ignore it in the if condition.First Name of employee is stored in the 0th index so we are fetching the first name of employee using the 0th index.We also require the file name to store as the value against the first name so we are fetching the file name that is processed in the mapper using the

 context.getInputSplit()).getPath().getName() 

and adding it to the value.


import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class InvertedIndexNameMapper extends Mapper<Object, Text, Text, Text> {

private Text nameKey = new Text();
private Text fileNameValue = new Text();

@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String data = value.toString();
String[] field = data.split(",", -1);
String firstName = null;

if (null != field && field.length == 9 && field[0].length() > 0) {
firstName=field[0];
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
nameKey.set(firstName);
fileNameValue.set(fileName);
context.write(nameKey, fileNameValue);
}

}

}
Reducer Code

The reducer iterates through the set of input values and appends each file name to a String, delimited by a space character. The input key is output along with this concatenation also we do a check to make sure we are not printing the duplicate file name for the same first name.

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class InvertedIndexNameReducer extends Reducer<Text, Text, Text, Text> {

private Text result = new Text();

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

StringBuilder sb = new StringBuilder();
boolean first = true;
for (Text value : values) {

if (first) {
first = false;
} else {
sb.append(" ");
}

if (sb.lastIndexOf(value.toString()) < 0) {
sb.append(value.toString());
}

}
result.set(sb.toString());

context.write(key, result);

}

}
Driver Code

Finally we will use the driver class to test everything is working fine as expected

import java.io.File;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DriverInvertedIndex {

public static void main(String[] args) throws Exception {

/*
* I have used my local path in windows change the path as per your
* local machine
*/

args = new String[] { "Replace this string with Input Path location",
"Replace this string with output Path location" };

/* delete the output directory before running the job */

FileUtils.deleteDirectory(new File(args[1]));

/* set the hadoop system parameter */

System.setProperty("hadoop.home.dir", "Replace this string with hadoop home directory location");

if (args.length != 2) {
System.err.println("Please specify the input and output path");
System.exit(-1);
}

Configuration conf = ConfigurationFactory.getInstance();
Job job = Job.getInstance(conf);
job.setJarByClass(DriverInvertedIndex.class);
job.setJobName("Find_Average_Salary");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(InvertedIndexNameMapper.class);
job.setReducerClass(InvertedIndexNameReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Sample Output

AARON     employee_info_1.csv   employee_info_2.csv

ABAD JR   employee_info_1.csv

ABARCA   employee_info_1.csv