AVG
Use the AVG function to compute the average of the numeric values in a single-column bag. AVG requires a preceding GROUP ALL statement for global averages and a GROUP BY statement for group averages.
Input Data
1,/shelf=0/slot/port=1,10
2,/shelf=0/slot/port=2,20
3,/shelf=0/slot/port=3,30
4,/shelf=0/slot/port=4,40
5,/shelf=0/slot/port=2,50
6,/shelf=0/slot/port=3,60
A = LOAD 'service_perf.txt' using PigStorage(',') AS (rownum:chararray , neid:chararray,throughput:float ) ; B = group A by neid; C = foreach B generate A.neid,AVG(A.throughput); dump C; ({(/shelf=0/slot/port=1)},10.0) ({(/shelf=0/slot/port=2),(/shelf=0/slot/port=2)},35.0) ({(/shelf=0/slot/port=3),(/shelf=0/slot/port=3)},45.0) ({(/shelf=0/slot/port=4)},40.0)
CONCAT
Use the CONCAT function to concatenate two elements. The data type of the two elements must be the same, either chararray or bytearray.
Input Data
1,NDATEST,/shelf=0/slot/port=1
2,NDATEST,/shelf=0/slot/port=2
3,NDATEST,/shelf=0/slot/port=3
4,NDATEST,/shelf=0/slot/port=4
4,NDATEST,/shelf=0/slot/port=5
6,NDATEST,/shelf=0/slot/port=6
A = LOAD 'service.txt' using PigStorage(',') AS (service_id:chararray , neid:chararray,portid:chararray ) ; We want to concatenate neid and portid here; B = foreach A generate $0,CONCAT($1,$2); dump B; (1,NDATEST/shelf=0/slot/port=1) (2,NDATEST/shelf=0/slot/port=2) (3,NDATEST/shelf=0/slot/port=3) (4,NDATEST/shelf=0/slot/port=4) (4,NDATEST/shelf=0/slot/port=5) (6,NDATEST/shelf=0/slot/port=6)
COUNT and COUNT_STAR
Use the COUNT function to compute the number of elements in a bag. COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts. The COUNT function ignores NULL values. If you want to include NULL values in the count computation then use COUNT_STAR. You cannot use the tuple designator (*) with COUNT; that is, COUNT(*) will not work.
Input Data
1,NDATEST,/shelf=0/slot/port=1
2,NDATEST,/shelf=0/slot/port=2
3,NDATEST,/shelf=0/slot/port=3
4,NDATEST,/shelf=0/slot/port=4
4,NDATEST,/shelf=0/slot/port=5
6,NDATEST,/shelf=0/slot/port=6
A = LOAD 'service.txt' using PigStorage(',') AS (service_id:chararray , neid:chararray,portid:chararray ) ; B = group A by service_id; C = FOREACH B GENERATE COUNT(A); dump C; (1) (1) (1) (2) (1)
DIFF
The DIFF function compares two fields in a tuple. If the field values match, null is returned. If the field values do not match, the non-matching elements are returned.
A = LOAD 'bag_data' AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)}); DUMP A; ({(8,9),(0,1)},{(8,9),(1,1)}) ({(2,3),(4,5)},{(2,3),(4,5)}) ({(6,7),(3,7)},{(2,2),(3,7)}) DESCRIBE A; a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}} X = FOREACH A DIFF(B1,B2); dump x; ({(0,1),(1,1)}) ({}) ({(6,7),(2,2)})
IsEmpty
The IsEmpty function checks if a bag or map is empty (has no data). The function can be used to filter data.
MAX
Computes the maximum of the numeric values or chararrays in a single-column bag. MAX requires a preceding GROUP ALL statement for global maximums and a GROUP BY statement for group maximums.
Input Data
1,/shelf=0/slot/port=1,10
2,/shelf=0/slot/port=2,20
3,/shelf=0/slot/port=3,30
4,/shelf=0/slot/port=4,40
5,/shelf=0/slot/port=2,50
6,/shelf=0/slot/port=3,60
A = LOAD 'service_perf.txt' using PigStorage(',') AS (rownum:chararray , neid:chararray,throughput:float ) ; B = group A by neid; C = foreach B generate group,MAX(A.throughput);
MIN
Computes the minimum of the numeric values or chararrays in a single-column bag. MIN requires a preceding GROUP… ALL statement for global minimums and a GROUP … BY statement for group minimums.
Input Data
1,/shelf=0/slot/port=1,10
2,/shelf=0/slot/port=2,20
3,/shelf=0/slot/port=3,30
4,/shelf=0/slot/port=4,40
5,/shelf=0/slot/port=2,50
6,/shelf=0/slot/port=3,60
A = LOAD 'service_perf.txt' using PigStorage(',') AS (rownum:chararray , neid:chararray,throughput:float ) ; B = group A by neid; C = foreach B generate group,MIN(A.throughput);
SIZE
Use the SIZE function to compute the number of elements based on the data type (see the Types Tables below). SIZE includes NULL values in the size computation. SIZE is not algebraic. The return value of size depends an the data type for long,float and double it returns 1 . For chararray it returns number of characters in the array and for bytearray returns number of bytes in the array. For tuple,bag and map it returns number of fields in the tuple,number of tuples in bag and number of key/value pairs in map respectively.
Input Data
1,/shelf=0/slot/port=1,10
2,/shelf=0/slot/port=2,20
3,/shelf=0/slot/port=3,30
4,/shelf=0/slot/port=4,40
5,/shelf=0/slot/port=2,50
6,/shelf=0/slot/port=3,60
A = LOAD 'service_perf.txt' using PigStorage(',') AS (rownum:chararray , neid:chararray,throughput:float ) ; B = FOREACH A GENERATE SIZE(neid); dump B; (20) (20) (20) (20) (20) (20)
SUM
Computes the sum of the numeric values in a single-column bag. SUM requires a preceding GROUP ALL statement for global sums and a GROUP BY statement for group sums.
Input Data
1,/shelf=0/slot/port=1,10
2,/shelf=0/slot/port=2,20
3,/shelf=0/slot/port=3,30
4,/shelf=0/slot/port=4,40
5,/shelf=0/slot/port=2,50
6,/shelf=0/slot/port=3,60
A = LOAD 'service_perf.txt' using PigStorage(',') AS (rownum:chararray , neid:chararray,throughput:float ) ; B = group A by neid; C = foreach B generate group,SUM(A.throughput); dump C; (/shelf=0/slot/port=1,10.0) (/shelf=0/slot/port=2,70.0) (/shelf=0/slot/port=3,90.0) (/shelf=0/slot/port=4,40.0)
TOKENIZE
Splits a string and outputs a bag of words.Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote(“), coma(,) parenthesis(()), star(*).
A = LOAD 'data' AS (f1:chararray); DUMP A; (Here is the first string.) (Here is the second string.) (Here is the third string.) X = FOREACH A GENERATE TOKENIZE(f1); DUMP X; ({(Here),(is),(the),(first),(string.)}) ({(Here),(is),(the),(second),(string.)}) ({(Here),(is),(the),(third),(string.)})
Very good explanations Adarsh.I