pig tutorial 5 – debugging pig with diagnostic operators like describe, dump, explain and illustrate

DESCRIBE

Use the DESCRIBE operator to review the schema of a particular alias.

Input Service Data

1,NDATEST,/shelf=0/slot/port=1
2,NDATEST,/shelf=0/slot/port=2
3,NDATEST,/shelf=0/slot/port=3
4,NDATEST,/shelf=0/slot/port=4
4,NDATEST,/shelf=0/slot/port=5
6,NDATEST,/shelf=0/slot/port=6


A = LOAD 'service.txt' using PigStorage(',') AS (service_id:chararray , neid:chararray,portid:chararray ) ;

B = group A by service_id;

describe B;

B: {group: chararray,A: {(service_id: chararray,neid: chararray,portid: chararray)}}

dump B;

(1,{(1,NDATEST,/shelf=0/slot/port=1)})
(2,{(2,NDATEST,/shelf=0/slot/port=2)})
(3,{(3,NDATEST,/shelf=0/slot/port=3)})
(4,{(4,NDATEST,/shelf=0/slot/port=5),(4,NDATEST,/shelf=0/slot/port=4)})
(6,{(6,NDATEST,/shelf=0/slot/port=6)})

DUMP

Use the DUMP operator to run Pig Latin statements and display the results to your screen. DUMP is meant for interactive mode; statements are executed immediately and the results are not saved. You can use DUMP as a debugging device to make sure that the results you are expecting are actually generated.

DUMP alias;

EXPLAIN

Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.If no script is given the logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent optimizations such as applying filters early on also apply. The physical plan shows how the logical operators are translated to backend-specific physical operators. Some backend optimizations also apply. The map reduce plan shows how the physical operators are grouped into map reduce jobs.If a script without an alias is specified, it will output the entire execution graph (logical, physical, or map reduce).If a script with a alias is specified, it will output the plan for the given alias.


A = LOAD 'service.txt' using PigStorage(',') AS (service_id:chararray , neid:chararray,portid:chararray ) ;

B = group A by service_id;

EXPLAIN B;

#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
B: (Name: LOStore Schema: group#217:chararray,A#225:bag{#230:tuple(service_id#217:chararray,neid#218:chararray,portid#219:chararray)})
|
|---B: (Name: LOCogroup Schema: group#217:chararray,A#225:bag{#230:tuple(service_id#217:chararray,neid#218:chararray,portid#219:chararray)})
| |
| service_id:(Name: Project Type: chararray Uid: 217 Input: 0 Column: 0)
|
|---A: (Name: LOForEach Schema: service_id#217:chararray,neid#218:chararray,portid#219:chararray)
| |
| (Name: LOGenerate[false,false,false] Schema: service_id#217:chararray,neid#218:chararray,portid#219:chararray)ColumnPrune:InputUids=[217, 218, 219]ColumnPrune:OutputUids=[217, 218, 219]
| | |
| | (Name: Cast Type: chararray Uid: 217)
| | |
| | |---service_id:(Name: Project Type: bytearray Uid: 217 Input: 0 Column: (*))
| | |
| | (Name: Cast Type: chararray Uid: 218)
| | |
| | |---neid:(Name: Project Type: bytearray Uid: 218 Input: 1 Column: (*))
| | |
| | (Name: Cast Type: chararray Uid: 219)
| | |
| | |---portid:(Name: Project Type: bytearray Uid: 219 Input: 2 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: service_id#217:bytearray)
| |
| |---(Name: LOInnerLoad[1] Schema: neid#218:bytearray)
| |
| |---(Name: LOInnerLoad[2] Schema: portid#219:bytearray)
|
|---A: (Name: LOLoad Schema: service_id#217:bytearray,neid#218:bytearray,portid#219:bytearray)RequiredFields:null

#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
B: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-59
|
|---B: Package[tuple]{chararray} - scope-56
|
|---B: Global Rearrange[tuple] - scope-55
|
|---B: Local Rearrange[tuple]{chararray}(false) - scope-57
| |
| Project[chararray][0] - scope-58
|
|---A: New For Each(false,false,false)[bag] - scope-54
| |
| Cast[chararray] - scope-46
| |
| |---Project[bytearray][0] - scope-45
| |
| Cast[chararray] - scope-49
| |
| |---Project[bytearray][1] - scope-48
| |
| Cast[chararray] - scope-52
| |
| |---Project[bytearray][2] - scope-51
|
|---A: Load(service.txt:PigStorage(',')) - scope-44

 

#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-60
Map Plan
B: Local Rearrange[tuple]{chararray}(false) - scope-57
| |
| Project[chararray][0] - scope-58
|
|---A: New For Each(false,false,false)[bag] - scope-54
| |
| Cast[chararray] - scope-46
| |
| |---Project[bytearray][0] - scope-45
| |
| Cast[chararray] - scope-49
| |
| |---Project[bytearray][1] - scope-48
| |
| Cast[chararray] - scope-52
| |
| |---Project[bytearray][2] - scope-51
|
|---A: Load(service.txt:PigStorage(',')) - scope-44--------
Reduce Plan
B: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-59
|
|---B: Package[tuple]{chararray} - scope-56--------
Global sort: false
----------------

ILLUSTRATE

Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements . The data load statement must include a schema. The Pig Latin statement used to form the relation that is used with the ILLUSTRATE command cannot include the map data type, the LIMIT and SPLIT operators, or nested FOREACH statements.

ILLUSTRATE accesses the ExampleGenerator algorithm which can select an appropriate and concise set of example data automatically. It does a better job than random sampling would do for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate all the sampled data, giving you empty results which will not help with debugging. With the ILLUSTRATE operator you can test your programs on small datasets and get faster turnaround times. The ExampleGenerator algorithm uses Pig’s Local mode (rather than Hadoop mode) which means that illustrative example data is generated in near real-time.

Input Service Data

1,NDATEST,/shelf=0/slot/port=1
2,NDATEST,/shelf=0/slot/port=2
3,NDATEST,/shelf=0/slot/port=3
4,NDATEST,/shelf=0/slot/port=4
4,NDATEST,/shelf=0/slot/port=5
6,NDATEST,/shelf=0/slot/port=6


A = LOAD 'service.txt' using PigStorage(',') AS (service_id:chararray , neid:chararray,portid:chararray ) ;

B = group A by service_id;

illustrate B;

--------------------------------------------------------------------------------
| A | service_id:chararray | neid:chararray | portid:chararray |
--------------------------------------------------------------------------------
| | 4 | NDATEST | /shelf=0/slot/port=4 |
| | 4 | NDATEST | /shelf=0/slot/port=5 |
--------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------
| B | group:chararray | A:bag{:tuple(service_id:chararray,neid:chararray,portid:chararray)} |
-------------------------------------------------------------------------------------------------------------------------
| | 4 | {(4, NDATEST, /shelf=0/slot/port=4), (4, NDATEST, /shelf=0/slot/port=5)} |
------------------------------------------------------------------------------------------------------------------------

DEFINE

Assigns an alias to a UDF function or a streaming command.

DEFINE alias {function | [‘command’ [input] [output] [ship] [cache]] };

Use the DEFINE statement to assign a name (alias) to a UDF function or to a streaming command.

Use DEFINE to specify a UDF function when the function has a long package name that you don’t want to include in a script, especially if you call the function several times in that script. The constructor for the function takes string parameters. If you need to use different constructor parameters for different calls to the function you will need to create multiple defines – one for each parameter set. Use DEFINE to specify a streaming command when the streaming command specification is complex and also when the streaming command specification requires additional parameters (input, output, and so on).

REGISTER

Registers a JAR file so that the UDFs in the file can be used.You can register additional files (to use with your Pig script) via the command line using the -Dpig.additional.jars option.

register mnr-multistorage-store-v1.0.0.jar;