pig tutorial 7 – pig load and store functions with compression and shell, file and utility commands

Load/Store functions determine how data goes into Pig and comes out of Pig. Pig provides a set of built-in load/store functions and also we can write our own custom load/store functions if required.

Handling Compression

Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.

To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.

To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.

A = load ‘myinput.gz’;
store A into ‘myoutput.gz’;

To work with bzip compressed files, the input/output files need to have a .bz or .bz2 extension. Because the compression is block-oriented, bzipped files can be split across multiple maps.

A = load ‘myinput.bz’;
store A into ‘myoutput.bz’;

Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED FILES generated in this manner.

cat *.gz > text/concat.gz
cat *.bz > text/concat.bz
cat *.bz2 > text/concat.bz2

If you use concatenated gzip or bzip files with your Pig jobs, you will NOT see a failure but the results will be INCORRECT.

BinStorage

Loads and stores data in machine-readable format.BinStorage works with data that is represented on disk in machine-readable format. BinStorage does NOT support compression.BinStorage is used internally by Pig to store the temporary data that is created between multiple map/reduce jobs.

A = LOAD ‘data’ USING BinStorage();

STORE X into ‘output’ USING BinStorage();

PigStorage

PigStorage is the default function for the LOAD and STORE operators and works with both simple and complex data types.PigStorage supports structured text files (in human-readable UTF-8 format). PigStorage also supports compression.

Load statements – PigStorage expects data to be formatted using field delimiters, either the tab character (‘\t’) or other specified character.

Store statements – PigStorage outputs data using field deliminters, either the tab character (‘\t’) or other specified character, and the line feed record delimiter (‘\n’).

Field Delimiters – For load and store statements the default field delimiter is the tab character (‘\t’). You can use other characters as
field delimiters, but separators such as ^A or Ctrl-A should be represented in Unicode (\u0001) using UTF-16 encoding (see Wikipedia ASCII, Unicode, and UTF-16).

Record Deliminters – For load statements Pig interprets the line feed ( ‘\n’ ), carriage return ( ‘\r’ or CTRL-M) and combined CR + LF ( ‘\r\n’ ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed (‘\n’) character as the record delimiter.

A = LOAD ‘$input_path’ using PigStorage(‘,’) AS (service_id:chararray , neid_portid:chararray );

store A into ‘$wf_output_path’ USING PigStorage(‘,’);

PigDump

Stores data in human-readable UTF-8 format.

STORE X INTO ‘output’ USING PigDump();

TextLoader

Loads unstructured data in UTF-8 format. Each resulting tuple contains a single field with one line of input text. TextLoader also supports compression.Currently, TextLoader support for compression is limited and TextLoader cannot be used to store data.

Shell, File and Utility Commands

Command like fs,cat,cd,copyFromLocal,copyToLocal,cp,ls,mkdir,mv,pwd,rm,rmf are also supported in pig.

Utility commands exec which is used to run a Pig script with no interaction between the script and the Grunt shell is supported and along with this commands like help,kill,quit,run,set is supported.