inverted index spark example

Inverted index pattern is used to generate an index from a data set to allow for faster searches or data enrichment capabilities.It is often convenient to index large data sets on keywords, so that searches can trace terms back to records that contain specific values. While building an inverted index does require extra processing up front, taking the time to do so can greatly reduce the amount of time it takes to find something. Search engines build indexes to improve search performance. Imagine entering a keyword and letting the engine crawl the Internet and build a list of pages to return to you. Such a query would take an extremely long amount of time to complete. By building an inverted index, the search engine knows all the web pages related to a keyword ahead of time and these results are simply displayed to the user. These indexes are often ingested into a database for fast query responses.

Problem to Solve

Given a employees information documents create a inverted index for department name based on the first name.

Here is a sample input data attached employee_info.csv

Build an inverted index as Name ->Department Name.

we will be creating inverted index as below so that it will be faster to search employee details based on the department.

LETRICH POLICE
DELVALLE STREETS & SAN,POLICE
JOSEPH FIRE,HEALTH,AVIATION,GENERAL SERVICES,STREETS & SAN,OEMC,LAW
BAYLIAN POLICE
ZHEN PUBLIC LIBRARY
KUBIAK POLICE,FIRE,WATER MGMNT

Below is the sample input for reference


First Name,Last Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate

dubert,tomasz ,paramedic i/c,fire,f,salary,,91080.00,
edwards,tim p,lieutenant,fire,f,salary,,114846.00,
elkins,eric j,sergeant,police,f,salary,,104628.00,
estrada,luis f,police officer,police,f,salary,,96060.00,
ewing,marie a,clerk iii,police,f,salary,,53076.00,
finn,sean p,firefighter,fire,f,salary,,87006.00,
fitch,jordan m,law clerk,law,f,hourly,35,,14.51

Below is the code to build the inverted index in spark


import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

public class InvertedIndex {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("pattern").setMaster("local");

JavaSparkContext jsc = new JavaSparkContext(conf);

JavaRDD<String> rdd = jsc.textFile("Input Path");

JavaPairRDD<String, String> pair = rdd.mapToPair(new PairFunction<String, String, String>() {

@Override
public Tuple2<String, String> call(String value) throws Exception {
String data = value.toString();
String[] field = data.split(",", -1);

return new Tuple2<String, String>(field[0], field[3]);
}
});

JavaPairRDD<String, String> output = pair.reduceByKey(new Function2<String, String, String>() {

@Override
public String call(String arg0, String arg1) throws Exception {

if (!arg0.contains(arg1)) {
arg0 = arg0 +","+ arg1;
}

return arg0;
}
});

for (Tuple2<String, String> string : output.collect()) {

System.out.println(string._1 + " " + string._2);

}

}

}