pig tutorial 2 – pig data types, relations, bags, tuples, fields and parameter substitution

Relations, Bags, Tuples, Fields

Pig Latin statements work with relations. A relation is a bag and a bag is a collection of tuples and tuple is an ordered set of fields and field is a piece of data. A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don’t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.

Relations are referred to by name or alias. Names are assigned by you as part of the Pig Latin statement.

A = LOAD 'hdfs://nameservice/user/queuename/pigtest/input' using PigStorage(',') AS (element_name:chararray, subelement_name:chararray, service_id:chararray);

Fields are referred to by positional notation or by name

1. Positional notation is generated by the system. Positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2 refers to the fields element_name,subelement_name and service_id in the above relation.

Below is an example of using it with a for each clause which concatenates the 3 fields.


input_with_filename = FOREACH A {

element_name = (chararray)$0;
subelement_name = (chararray)$1;
service_id = (chararray)$2;

GENERATE CONCAT(element_name,subelement_name,service_id);
}

2. Names are assigned by you using schemas and we can use any name that is not a Pig keyword as below


A_valid_data = FOREACH A GENERATE element_name , subelement_name, service_id ((element_name is not null and
subelement_name is not null and
service_id is not null and ) ? 'valid' : 'invalid') AS (type:chararray);

Referencing Fields that are Complex Data Types

In this example the data file contains tuples. A schema for complex data types (in this case, tuples) is used to load the data. Then, dereference operators (the dot in t1.t1a and t2.$0) are used to access the fields in the tuples.


cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)

A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));

DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))

X = FOREACH A GENERATE t1.t1a,t2.$0;

DUMP X;
(3,4)
(1,3)
(2,9)

Data Types

The supported simple data types are int,long,float,double,Arrays,chararray and bytearray.

tuple

The supported complex data types are tuple is a ordered set of fields (1,2),bag is a collection of tuples {(1,2), (3,4)} and map is a set of key value pairs.

You can think of a tuple as a row with one or more fields, where each field can be any data type and any field may or may not have data. If a field has no data, then the following happens.

1. In a load statement, the loader will inject null into the tuple. The actual value that is substituted for null is loader specific for example, PigStorage substitutes an empty field for null.
2. In a non-load statement, if a requested field is missing from a tuple, Pig will inject null.

bag

A bag is a collection of tuples. Below are some points to consider

1. A bag can have duplicate tuples.
2. A bag can have tuples with differing numbers of fields. However, if Pig tries to access a field that does not exist, a null value is substituted.
3. A bag can have tuples with fields that have different data types. However, for Pig to effectively process bags, the schemas of the tuples within those bags should be the same. For example, if half of the tuples include chararray fields and while the other half include float fields, only half of the tuples will participate in any kind of computation because the chararray fields will be converted to null.

Bags have two forms: outer bag (or relation) and inner bag.

Outer Bag – In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.


A = LOAD 'data' as (f1:int, f2:int, f3;int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)

Inner Bag – Now, suppose we group relation A by the first field to form relation X.

In this example X is a relation or bag of tuples. The tuples in relation X have two fields. The first field is type int. The second field is type bag; you can think of this bag as an inner bag.


X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})

Use schemas to assign types to fields

If you don’t assign types, fields default to type bytearray and implicit conversions are applied to the data depending on the context in which that data is used. For example, in relation B, f1 is converted to integer because 5 is integer. In relation C, f1 and f2 are converted to double because we don’t know the type of either f1 or f2.


A = LOAD 'data' AS (f1,f2,f3);
B = FOREACH A GENERATE f1 + 5;
C = FOREACH A generate f1 + f2;

If a schema is defined as part of a load statement, the load function will attempt to enforce the schema. If the data does not conform to the schema, the loader will generate a null value or an error.


A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
If an explicit cast is not supported, an error will occur. For example, you cannot cast a chararray to int.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE (int)name;
This will cause an error …

If Pig cannot resolve incompatible types through implicit casts, an error will occur. For example, you cannot add chararray and float (see the Types Table for addition and subtraction).

A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name + gpa;

This will cause an error …

Parameter Substitution

Substitute values for parameters at run time.Also exec, run, and explain also support parameter substitution

pig {–param param_name = param_value | –param_file file_name} [-debug | -dryrun] script

debug – With this option, the script is run and a fully substituted Pig script produced in the current working directory named original_script_name.substituted

dryrun – With this option, the script is not run and a fully substituted Pig script produced in the current working directory named original_script_name.substituted.

script – With this option, the script is not run and a fully substituted Pig script produced in the current working directory named original_script_name.substituted

script – The pig script must be the last element in the Pig command line.

1. If parameters are specified in the Pig command line or in a parameter file, the script should include a $param_name for each para_name included in the command line or parameter file.

2. If parameters are specified using the preprocessor statements, the script should include either %declare or %default. In the script, parameter names can be escaped with the backslash character ( \ ) in which case substitution does not take place.

%declare – Preprocessor statement included in a Pig script.Use to describe one parameter in terms of other parameters.The declare statement is processed prior to running the Pig script.The scope of a parameter value defined using declare is all the lines following the declare statement until the next declare statement that defines the same parameter is encountered.

%default – Preprocessor statement included in a Pig script. Use to provide a default value for a parameter. The default value has the lowest priority and is used if a parameter value has not been defined by other means.The default statement is processed prior to running the Pig script. The scope is the same as for %declare.

Precedence

Precedence for parameters is as follows

1. parameters defined using the declare statement.

2. parameters defined in the command line or file.

3. parameters defined using default.