Hive tutorial 10 – Hive example for writing custom user defined function

These are regular user-defined functions that operate row-wise and output one result for one row.Lets say we have a input data as below

1920,1920./shelf=0/slot=5/port=1,BBG199999999,12/26/2009 10:24,
1920,1920./shelf=0/slot=4/port=6,BBGtest110,07/06/2009 13:15,
1920,1920./shelf=0/slot=5/port=24,BBG19200524,08/19/2009 06:44,
1920,1920./shelf=0/slot=5/port=0,BBG1920050,07/06/2009 13:15,

We need a hive custom function which will convert the 1920./shelf=0/slot=5/port=1 string to /shelf=0/slot=5/port=1. We need write a custom java class to define user defined function which extends ora.apache.hadoop.hive.sq.exec.UDF and implements more than one evaluate() methods.

Hive Custom UDF Code

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.Text;

@Description(name = "udf_name", value = "_FUNC_(arg1, arg2, ... argN) - A short description for the function",
extended = "This is more detail about the function, such as syntax,examples.")
@UDFType(deterministic = true, stateful = false)
public class CustomStripUdf extends UDF {

private Text result = new Text();

public Text evaluate(Text str) {
if (str == null) {
return null;
}

String[] sub_id = str.toString().split(".");
result.set(sub_id[0]);

return result;
}

}

The @Description annotation is a useful Hive specific annotation to provide usage information for the UDF in the Hive console. The information defined in the value property will be shown in the HQL DESCRIBE FUNCTION command. The information defined in the extended property will be shown in the HQL DESCRIBE FUNCTION EXTENDED command. The @UDFType annotation tells Hive what behaviour to expect from the function. A deterministic UDF (deterministic = true) is a function that always gives the same result when passed the same arguments, such as LENGTH(string input), MAX(), and so on. On the other hand, a non-deterministic (deterministic = false) UDF can return a different result for the same set of arguments, for example, UNIX_TIMESTAMP() returning the current timestamp in the default time zone. The stateful (stateful = true) property allows functions to keep some static variables available across rows, such as ROW_NUMBER(), which assigns sequential numbers for all rows in a table.

All UDFs extend the Hive UDF class, so the UDF subclass must implement the evaluate method called by Hive. The evaluate method can be overridden for a different purpose. In this method, we can implement whatever logic and exception handling the design for the function using the Java Hadoop library and the Hadoop data type for MapReduce data serialization, such as TEXT, DoubleWritable, INTWritable, and so on.

Once you have the code ready package your java class into a jar file and it to the hive class path and add a temporary function


hive> ADD HiveUdf.jar;

hive> CREATE TEMPORARY FUNCTION stripPorts as 'com.hadoop.hive.custom.CustomStripUdf';

hive> select stripPorts(subelement_id) from service_table;

The service_table has below data


1920,1920./shelf=0/slot=5/port=1,BBG199999999,12/26/2009 10:24,
1920,1920./shelf=0/slot=4/port=6,BBGtest110,07/06/2009 13:15,
1920,1920./shelf=0/slot=5/port=24,BBG19200524,08/19/2009 06:44,
1920,1920./shelf=0/slot=5/port=0,BBG1920050,07/06/2009 13:15,

The output would be

/shelf=0/slot=5/port=1
/shelf=0/slot=4/port=6
/shelf=0/slot=5/port=24
/shelf=0/slot=5/port=0