step1:create a dataset
cat > dataset1
My Name is Manohar
I Love Teaching
creating and loading data into a bag
------------------------------------
$bagname =load 'filename' as (line:chararray);
Examine data
------------
dump bagname;
Running pig in local
--------------------
pig -x local
grunt> dset1 =load 'dataset1'
>> as (line:chararray);
grunt>dump dset1;
creating a bag with multiple cols
---------------------------------
cat > data.txt
1\t2\t3
4\t5\t6
7\t8\t9
Note:The default delimiter in pig is tab(\t)
grunt> data = load 'data.txt' as (a:int,b:int,c:int);
grunt>dump dset2;
cat > file1
1,2,3
4,5,6
7,8,8
Load into pig
-------------
grunt> data = load 'file1' as (a:int,b:int,c:int);
grunt>dump data;
output
------
(,,)
(,,)
(,,)
Here the default file format is tab(\t) but we have loaded file with comma(,) separated,bag format is not compatible with file foramat
File with customized delimitor
-------------------------------
we can load a file into bag using customized delimitor by using the
PigStorage()
syntax
------
varname = load 'filename'
using PigStorage('symbol')
as (col1:type,col2:type,....);
example
--------
data =load 'file1'
using PigStorage(',')
as(a:int,b:int,c:int);
describe data;
dump data;
Stroring data from a bag/tuple from pig into lfs/hdfs
-----------------------------------------------------
store
-----
we use store command to prsist the data from a bag or a tuple in pig to a lfs/hdfs depends on mode we are running in.
syntax
------
store bagname into 'dirname'
using PigStorage('symbol');
example
-------
grunt> store data into 'numbers'
using PigStorage(',');
Now we find a file created under directory 'numbers' in either lfs/hdfs depends on mode we are running
In lfs
------
/home/cloudera/numbers/part-m-00000
In hdfs
-------
/user/cloudera/numbers/part-m-00000
$pig
Running in hdfs
Here pig had acces to hdfs
[cloudera@quickstart ~]$ cat > emp.txt
1,mano,1000.00,m,11
2,venkat,2000.00,m,12
3,subbu,3000.00,m,13
4,pra,4000.00,m,14
5,bond,5000.00,m,15
^C
load file into hdfs
-------------------
[cloudera@quickstart ~]$ hadoop fs -put /home/cloudera/emp.txt /user/cloudera/
create a bag and load data
--------------------------
grunt> emp =load 'emp.txt'
using PigStorage(',')
as (eno:int,name:chararray,sal:double,gender:chararray,dno:int);
grunt>describe emp;
grunt>dump emp;
describe
--------
It used to verify schema of a bag.
syntax
------
describe bagname;
ex:describe emp;
illustrate
----------
It is used diplay shema and row of a bag
example
-------
illustrate bagname;
illustrate emp;
Filtering data
--------------
Filter eno,name,sal from emp
we can filter data by using foreach
syntax
------
bagname = foreach bagname1 generate col1,col2,....;
select eno,name,sal from edset
grunt> employee =foreach edset generate eno,name,sal;
grunt>describe employee;
grunt>dump employee;
(1,mano,1000.00)
(2,............)
...
...
Adding additional fields to a bag
---------------------------------
varname =foreach bagname generat*,
val1 as col,val2 as col,....;
example
-------
grunt> emp1 =foreach emp generate*
>> , sal*0.1 as tax,sal*0.2 as hra;
Note:
we cannot resuse the alias cols in the same query it leads to error.
grunt> emp1 =foreach emp generate*,
>> sal*0.1 as tax,sal*0.2 as hra;
grunt> emp2 =foreach emp1 generate*,
>>sal-tax+hra as net;
Changing datatypes
-------------------
we can change the datatype of bag by applying casting to col
syntax
------
(type)colname
grunt> emp3 =foreach emp2 generate eno,name,sal,(int)tax,(int)hra,(int)net;
grunt> describe emp3;
emp3: {eno: int,name: chararray,sal: double,tax: int,hra: int,net: int}
renaming columns
----------------
bagname =foreach bagname1 generate col1 as newname1,......;
grunt> emp =foreach emp generate eno as ecode,name,sal,gender,dno;
conditional transformations
---------------------------
greatest of 2 numbers
greatest of 3 numbers
grunt> big =foreach file1 generate*,
>> (a>b?(a>c?a:c):(b>c?b:c)) as big;
copying relation to relation
----------------------------
grunt>bagname =foreach bagname1 generate*;
grunt> e =foreach emp generate*;
filter :-
to create subsets , based on given criteria.
(row filter, equivalant to 'where' clause of sql select statement )
syntax
------
bagname =filter bagname1 by (col1==value,.....);
grunt> erec = filter emp by (sex=='m');
grunt> dump e1
grunt> erec = filter emp by (sex=='m' and dno==12);
grunt> dump e1
Limit :-
to fetch top n number of tuples.
syntax
------
varname =limit bagname n;
grunt> top3 = limit emp 3;
grunt> dump top3;
sample:-
to create subsets, in random sample style.
syntax
------
varname =sample bagname percentage_num;
esample =sample emp 1;
display all recs from bag emp
esample =sample emp 0.5;
display 5*0.5 recs.
Aggregated functions in PigLatin.
SUM(), AVG(), MAX(), MIN(), COUNT()
grunt> emp2 = foreach emp1 generate SUM(sal) as tot;
grunt> dump emp2;
ABOVE statement will be failed during execution,.
bcoz, AGGREGATED functions are applied only on inner bags.
when you group data , inner bags will be produced.
group: -
to get inner bags foreach data group.
based on grouping field.
syntax
------
varname =group bagname by fieldname;
gbag =group emp1 by gender;
(f,{(9,,3000.0,f,12,600.0,120.0),(101,janaki,10000.0,f,12,2000.0,400.0),(104,lokitha,8000.0,f,12,1600.0,320.0),(102,Sri,25000.0,f,11,5000.0,1000.0)})
(m,{(1,,5000.0,m,12,1000.0,200.0),(105,naga,6000.0,m,13,1200.0,240.0),(103,mohan,13000.0,m,13,2600.0,520.0),(101,vino,26000.0,m,11,5200.0,1040.0)})
grunt> describe emp1;
emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int}
grunt>erec =foreach emp generate name,sal,gender,dno;
grunt>egrp =group erec by gender;
grunt> describe egrp;
egrp: {group: chararray,erec: {(name: chararray,sal: double,gender: chararray,dno: int)}}
grunt>tot =foreach egrp generate SUM(erec.sal) as total;
grunt>dump tot;
grunt> -- select sex, sum(sal) from emp group by sex
grunt> e = foreach emp generate sex, sal;
grunt> bySex = group e by sex;
grunt> describe bySex
bySex: {group: chararray,e: {sex: chararray,sal: int}}
grunt> dump bySex
grunt> res = foreach bySex generate
>> group as sex, SUM(e.sal) as tot;
grunt> describe res
res: {sex: chararray,tot: long}
grunt> store res into 'myhdfs1';
grunt> cat myhdfs1/part-r-00000
f 103000
m 125000
____________________________________
grunt> describe emp
emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int}
grunt> ee = foreach emp generate dno, sal;
grunt> byDno = group ee by dno;
grunt> res = foreach byDno generate
>> group as dno, SUM(ee.sal) as tot;
grunt> store res into 'pdemo/res1';
grunt> ls pdemo/res1
hdfs://localhost/user/training/pdemo/res1/_logs <dir>
hdfs://localhost/user/training/pdemo/res1/part-r-00000<r 1> 28
grunt> cat pdemo/res1/part-r-00000
11 51000
12 48000
13 129000
grunt>
grunt> -- single grouping and multiple aggregations
[cloudera@quickstart ~]$ cat emp.txt
1,aaaa,1000.00,f,11
2,bbbb,2000.00,m,12
3,cccc,3000.00,m,13
4,dddd,4000.00,f,14
5,eeee,5000.00,f,11
6,ffff,2000.00,m,15
7,gggg,3000.00,f,13
8,hhhh,4000.00,m,16
grunt> emp =load 'emp.txt'
>> using PigStorage(',')
>> as (eno:int,name:chararray,sal:double,gender:chararray,dno:int);
grunt> empgrp =group emp by gender;
grunt> describe empgrp;
empgrp: {group: chararray,emp: {(eno: int,name: chararray,sal: double,gender: chararray,dno: int)}}
grunt> empaggr =foreach empgrp generate
>>group as gender,
>> SUM(emp.sal) as sal
>> ,AVG(emp.sal) as avgsal,
>> MAX(emp.sal) as maxsal,
>> MIN(emp.sal) as minsal,
>> COUNT(emp.eno) as cnt;
grouping using multiple cols
---------------------------
syntax
------
varname =group bagname by (col1,col2,....);
grunt> mulgrp =group emp by (gender,dno);
grunt> describe mulgrp;
mulgrp: {group: (gender: chararray,dno: int),emp: {(eno: int,name: chararray,sal: double,gender: chararray,dno: int)}}
grunt>dump mulgrp;
((f,11),{(5,eeee,5000.0,f,11),(1,aaaa,1000.0,f,11)})
((f,13),{(7,gggg,3000.0,f,13)})
((f,14),{(4,dddd,4000.0,f,14)})
((m,12),{(2,bbbb,2000.0,m,12)})
((m,13),{(3,cccc,3000.0,m,13)})
((m,15),{(6,ffff,2000.0,m,15)})
((m,16),{(8,hhhh,4000.0,m,16)})
grunt> res1 = foreach gender generate
>> group as gender,
>> SUM(e.sal) as tot,
>> AVG(e.sal) as avg,
>> MAX(e.sal) as min,
>> MIN(e.sal) as mn,
>> COUNT(e) as cnt;
grunt> dump res1
(f,103000,20600.0,50000,8000,5)
(m,125000,25000.0,50000,6000,5)
_________________________________
grunt> -- multi grouping..
grunt> e = foreach emp generate dno, sex, sal;
grunt> grp = group e by dno, sex;
2016-06-09 19:28:35,893 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Unrecognized alias sex
Details at logfile: /home/training/pig_1465437040017.log
grunt>
above statement is invalid.
pig does not allow groping by multiple fields.
solution:
make multiple fields as a tuple field, and group it by tuple.
grunt> grp = group e by (dno, sex);
grunt> describe grp
grp: {group: (dno: int,sex: chararray),e: {dno: int,sex: chararray,sal: int}}
grunt> res = foreach grp generate
>> group.dno, group.sex, SUM(e.sal) as tot;
grunt> dump res
(11,f,25000)
(11,m,26000)
(12,f,18000)
(12,m,30000)
(13,f,60000)
(13,m,69000)
_________________________________
grunt> -- select sum(sal) from emp;
grunt> -- old one
grunt> e = foreach emp generate 'ibm' as org, sal;
grunt> dump e;
(ibm,26000)
(ibm,25000)
(ibm,13000)
(ibm,8000)
(ibm,6000)
(ibm,10000)
(ibm,30000)
(ibm,50000)
(ibm,10000)
(ibm,50000)
grunt> grp = group e by org;
grunt> res = foreach grp generate
>> SUM(e.sal) as tot;
grunt> dump res
(228000)
_____________________________________
2nd one --- for entire column aggregation.
grunt> describe emp;
emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int}
grunt> e = foreach emp generate sal;
grunt> grp = group e all;
grunt> dump grp
(all,{(26000),(25000),(13000),(8000),(6000),(10000),(30000),(50000),(10000),(50000)})
grunt> res = foreach grp generate
>> SUM(e.sal) as tot,
>> AVG(e.sal) as avg, MAX(e.sal) as max,
>> MIN(e.sal) as min, COUNT(e) as cnt;
grunt> dump res
(228000,22800.0,50000,6000,10)
grouping using multiple columns
-------------------------------
cogroup:-
_________
To get seperate inner bags (data groups) for each dataset.
so that, we can perform seperate aggregations on each data set.
varname =cogroup bag1 by col1,bag2 by col2,........;
cat > ds1
_______
(a,10)
(b,20)
(a,30)
(b,40)
_________
cat> ds2
_________
(a,30)
(c,30)
(c,40)
(a,20)
Processing xml data
-------------------
cat > course.txt
<course><name>hadoop</name><price>6000</price><dur>60</dur></course>
<course><name>java</name><price>1500</price><dur>90</dur></course>
hive>create table coursedata(line string);
hive>load data local inpath 'course.txt' into table coursedata;
hive>select *from coursedata;
coverting semistructured data into structured
---------------------------------------------
hive> create table course_details(name string,price double,dur int);
loading data from cousedata by using table to table copy
--------------------------------------------------------
hive> insert overwrite table course_details
> select xpath_string(line,'course/name'),
> xpath_string(line,'course/price'),
> xpath_string(line,'course/dur')
> from coursedata;
hive>select *from course_details;
xpath_string()
--------------
xml is a used to represent data heirarchical form.
we should from root-->child-->subchild etc,to perform such navigation we use the method xpath_string() to read data from a tag of an xml.
syntax
------
xpath_string(line,'roottag/childtag/...');
selecting each property from an xml record
------------------------------------------
hive> select xpath_string(line,'course/name'),
> xpath_string(line,'course/price'),
> xpath_string(line,'course/dur')
> from coursedata;
cogroup:-
_________
To get seperate inner bags (data groups) for each dataset.
so that, we can perform seperate aggregations on each data set.
syntax
------
varname =cogroup bag1 by col1,bag2 by col1;
data set1
_______
(a,10)
(b,20)
(a,30)
(b,40)
_________
data set2
_________
(a,30)
(c,30)
(c,40)
(a,20)
_________
dset1 =load 'ds1' using PigStorage(',')
as (pid:chararray,price:double);
dset2 =load 'ds2' using PigStorage(',')
as (pid:chararray,price:double);
grunt>cg =cogroup dset1 by pid,dset2 by pid;
grunt> sp =foreach cg generate
>> group as id,SUM(dset1.price) as tot1,
>>SUM(dset2.price) as tot2;
cat > sales1
p1,2000
p2,3000
p1,4000
p1,5000
p2,4000
p3,5000
cat > sales2
p1,6000
p2,8000
p1,1000
p1,5000
p1,6000
p2,6000
p2,8000
[training@localhost ~]$ hadoop fs -copyFromLocal sales1 /user/cloudera/
grunt> s1 = load 'pdemo/sales1'
>> using PigStorage(',')
>> as (pid:chararray, price:int);
grunt> s2 = load 'pdemo/sales2'
>> using PigStorage(',')
>> as (pid:chararray, price:int);
grunt> cg = cogroup s1 by pid, s2 by pid;
grunt> describe cg
cg: {group: chararray,s1: {pid: chararray,price: int},s2: {pid: chararray,price: int}}
grunt> dump cg
(p1,{(p1,2000),(p1,4000),(p1,5000)},{(p1,6000),(p1,1000),(p1,5000),(p1,6000)})
(p2,{(p2,3000),(p2,4000)},{(p2,8000),(p2,6000),(p2,8000)})
(p3,{(p3,5000)},{})
grunt> sg =cogroup sal1 by name,sal2 by name;
grunt> res = foreach sg generate
>> group as pid, SUM(s1.price) as tot1,
>> SUM(s2.price) as tot2;
grunt> dump res
(p1,11000,18000)
(p2,7000,22000)
(p3,5000,)
grunt> cg =foreach cg generate
>> name,(tot1 is null?0:tot1) as tot1,
>> (tot2 is null?0:tot2) as tot2;
(p1,11000,18000)
(p2,7000,22000)
(p3,5000,0)
union:-
______
union is used to combine 2 or more datasets.
syntax
------
varname =union bag1,bag2,....;
Note:Number of cols and order must be same.
ds1
________
name, sal
__________
aaa,10000
bbbb,30000
cccc,40000
___________
ds2
____________
name,sal
___________
xxx,30000
yyy,40000
zzz,60000
______________
grunt> ds1 =load 'dataset1'
>> using PigStorage(',')
>> as (pid:chararray,price:double);
grunt> ds2 =load 'dataset2'
>> using PigStorage(',')
>> as (pid:chararray,price:double);
e = union ds1 , ds2
if files has diffent schema
---------------------------
convert the format of all files to the same format.
ds3
_________
sal, name
___________
10000,abc
20000,def
____________
ds4 = foreach ds3 generate name, sal;
ds = union ds1, ds2, ds4;
___________________________
if files have different number of fields.
$ cat > e1
mano,30000,m
manoja,40000,f
vidhatri,50000,m
cat > e2
vijay,60000,11
lavanya,70000,12
e1 =load 'e1.txt'
using PigStorage(',')
as (name:chararray,sal:double,gender:chararray);
grunt> e2 =foreach e1 generate*,0 as dno;
e3 =load 'e2.txt'
using PigStorage(',')
as (name:chararray,sal:double,dno:int);
grunt> e4 =foreach e3 generate name,sal,'*' as gender,dno;
e2_e4 =union e2,e4;
grunt> e1 = load 'pdemo/e1' using PigStorage(',')
>> as (name:chararray, sal:int, sex:chararray);
grunt> e2 = load 'pdemo/e2' using PigStorage(',')
>> as (name:chararray, sal:int, dno:int);
grunt> ee1 = foreach e1 generate *,0 as dno;
grunt> ee2 = foreach e2 generate name,
>> sal,'*' as sex, dno;
grunt> ee = union ee1, ee2;
distinct:- to eliminate duplicates
syntax
------
varname =distinct bagname;
ds
id,name
(101,a)
(102,b)
(103,c)
(101,a)
(102,b)
(101,a)
(102,b)
(102,x)
_____________
ds2 = distinct ds;
(101,a)
(102,b)
(103,c)
(102,x)
order:
______
sort records of bag.
syntax
------
varname =order bagname by colname;
e = order emp by name;
e2 = order emp by sal desc;
e3 = order emp by sal desc,
dno , sex desc;
--- no limit for max number of sort fields.
Join
----
A Join is used to fetch records from 2 or more bags/relation.
we have 2 joins
i)inner
returns matching records between 2 bags
grunt> varname =join bagname1 by colname,bagname2 by colname;
[cloudera@quickstart ~]$ cat emp.txt
101,subbu,26000,m,11
102,mano,25000,f,11
103,prathush,13000,m,13
104,vidahtri,8000,f,12
105,naga,6000,m,13
101,hari,10000,f,12
201,venkat,30000,m,12
202,bbbb,50000,f,13
203,ccc,10000,f,13
204,ddddd,50000,m,13
304,xxx,70000,m,14
305,yyy,80000,f,15
[cloudera@quickstart ~]$ cat dept
cat: dept: No such file or directory
[cloudera@quickstart ~]$ cat dept.txt
11,marketing,hyd
12,hr,del
13,finance,hyd
20,prod,hyd
21,admin,chennai
[cloudera@quickstart ~]$ cat emp.txt
101,vino,26000,m,11
102,Sri,25000,f,11
103,mohan,13000,m,13
104,lokitha,8000,f,12
105,naga,6000,m,13
101,janaki,10000,f,12
201,aaa,30000,m,12
202,bbbb,50000,f,13
203,ccc,10000,f,13
204,ddddd,50000,m,13
304,xxx,70000,m,14
305,yyy,80000,f,15
[cloudera@quickstart ~]$ cat dept
[cloudera@quickstart ~]$ cat dept.txt
11,marketing,hyd
12,hr,del
13,finance,hyd
20,prod,hyd
21,admin,chennai
emp =load 'emp.txt'
using PigStorage(',')
as(eno:int,name:chararray,sal:double,gender:chararray,dno:int);
dept =load 'dept.txt'
using PigStorage(',')
as(dno:int,dept:chararray,loc:chararray);
grunt> emp_ij =join emp by dno,dept by dno;
grunt> describe emp_ij;
emp_ij: {emp::eno: int,emp::name: chararray,emp::sal: double,emp::gender: chararray,emp::dno: int,dept::dno: int,dept::dept: chararray,dept::loc: chararray}
ii)outer
left outer
All records from left bag and matching records from right bag
syntax
------
grunt> varname =join bagname1 by colname left outer,bagname2 by colname;
ex:
grunt> emp_lj =join emp by dno left outer,dept by dno;
right outer
-----------
returns matching rows from left bag and all rows from right bag.
syntax
------
grunt> varname =join bagname1 by colname right outer,bagname2 by colname;
ex:
grunt> emp_rj =join emp by dno right outer,dept by dno;
full outer
----------
returns all rows from left nd right bags
grunt> varname =join bagname1 by colname full outer,bagname2 by colname;
ex:
grunt> emp_rj =join emp by dno full outer,dept by dno;
displaying selected fiedls from a join
--------------------------------------
grunt> emp_data =foreach emp_ij generate
>> emp::eno as eno,
>> emp::name as name,
>> dept::dno as dno;
grunt> emp_fil =filter emp_ij by emp::sal>=20000 and emp::sal<=500000;
grunt> emp_dept_rj =join emp by dno full outer,dept by dno;
task:
getting top3 salaried list.
(case: a salary can be taken by multiple people).
cat > samps
aaa,10000
bbb,80000
ccc,90000
ddd,90000
eeee,90000
ffff,80000
mmmmm,80000
nnnnn,70000
nnnn,70000
nn,60000
m,65000
xx,10000
hadoop fs -copyFromLocal samps /user/cloudera/
grunt> data =load 'samps'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
2017-04-26 20:53:15,674 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-04-26 20:53:15,674 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> saldata =foreach data generate sal;
grunt> osal =order saldata by sal desc;
grunt> top3 =limit osal;
2017-04-26 20:55:10,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 19, column 16> Syntax error, unexpected symbol at or near ';'
Details at logfile: /home/cloudera/pig_1493233521379.log
grunt> top3 =limit osal 3;
[training@localhost ~]$
grunt> e = load 'pdemo/samps'
>> using PigStorage(',')
>> as (name:chararray, sal:int);
grunt> sals = foreach e generate sal;
grunt> sals = distinct sals;
grunt> sals2 = order sals by sal desc;
grunt> top3 = limit sals2 3;
grunt> dump top3
grunt> describe top3
top3: {sal: int}
grunt> describe e
e: {name: chararray,sal: int}
grunt> res = join e by sal , top3 by sal;
grunt> describe res;
res: {e::name: chararray,e::sal: int,top3::sal: int}
grunt> res = foreach res generate e::name as name,
>> e::sal as sal;
grunt> dump res
(nnnnn,70000)
(nnnn,70000)
(bbb,80000)
(ffff,80000)
(mmmmm,80000)
(ccc,90000)
(ddd,90000)
(eeee,90000)
_____________________________________
Cross:
gives cartisian product.
used for non-equi functionalities of joins.
[training@localhost ~]$ cat > matrimony
Ravi,25,m
Rani,24,f
Ilean,23,f
trisha,27,f
Kiran,29,m
madhu,22,m
avi,26,m
srithi,21,f
[training@localhost ~]$ hadoop fs -copyFromLocal matrimony pdemo
[training@localhost ~]$
grunt> matri = load 'pdemo/matrimony'
>> using PigStorage(',')
>> as (name:chararray, age:int, sex:chararray);
grunt> males = filter matri by (sex=='m');
grunt> fems = filter matri by (sex=='f');
grunt> cr = cross males, fems;
grunt> describe cr
cr: {males::name: chararray,males::age: int,males::sex: chararray,fems::name: chararray,fems::age: int,fems::sex: chararray}
grunt> mf = foreach cr generate males::name as mname, fems::name as fname , males::age as mage,
>> fems::age as fage;
grunt>
grunt> describe mf
mf: {mname: chararray,fname: chararray,mage: int,fage: int}
grunt> mlist = filter mf by
>> (mage>fage and (mage-fage)<4);
grunt> dump mlist;
(madhu,srithi,22,21)
(avi,Rani,26,24)
(avi,Ilean,26,23)
(Kiran,trisha,29,27)
(Ravi,Rani,25,24)
(Ravi,Ilean,25,23)
_________________________________
to submit scripts
3 commands:
i)Pig
ii)exec
iii)run
pig to submit from command prompt.
aliases will not be available in grunt.
exec- to submit script from grunt shell. aliases will not be available.
run- to submit script from grunt,
aliases will be available.
so that we reuse them.
[training@localhost ~]$ cat script1.pig
emp = load 'pdemo/emp' using PigStorage(',')
as (id:int, name:chararray, sal:int, sex:chararray, dno:int);
e = foreach emp generate sex, sal;
bySex = group e by sex;
res = foreach bySex generate group as sex, SUM(e.sal) as tot;
dump res
$ pig script1.pig
grunt> exec script1.pig
grunt> run script1.pig
______________________
register, define.
_____________________
Displaying top3 highest sal
---------------------------
runt> cat samps;
aaa,10000
bbb,80000
ccc,90000
ddd,90000
eeee,90000
ffff,80000
mmmmm,80000
nnnnn,70000
nnnn,70000
nn,60000
m,65000
xx,10000
grunt>
grunt> bags =load 'samps'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
grunt> salbag =foreach bags generate sal;
grunt> osb =order salbag by sal desc;
grunt> top3 =limit osb 3;
grunt> dump top3
cartesian product
-----------------
syntax
------
varname =cross bag1,bag2;
Ravi,25,m
Rani,24,f
Ilean,23,f
trisha,27,f
Kiran,29,m
madhu,22,m
avi,26,m
srithi,21,f
Divide into male and female bags and apply cartesian product
mat =load 'matrimony'
using PigStorage(',')
as (name:chararray,age:int,gender:chararray);
mat1 =filter mat by (gender=='m');
mat2 =filter mat by (gender=='f');
grunt> fm =cross mat1,mat2;
dump fm;
Pig udf
-------
package pig.udf;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ToBigCase extends EvalFunc<String> {
@Override
public String exec(Tuple t) throws IOException {
String val = (String) t.get(0);
String bname=val.toUpperCase();
return bname;
}
}
Goto project -->MyUdf-->export-->java-->Jar file-->select path where to store jar(select export destination)-->finish.
/home/cloudera/Desktop/udf.jar
Resgistering and defining an udf in pig environment
---------------------------------------------------
step1:
grunt>register jarname;
grunt> register /home/cloudera/Desktop/udf.jar;
step2:
grunt>define function_name
>>packagename.ClassName();
ex:
grunt> define toUpper
>> pig.udf.ToBigCase();
consuming udf defined in pig operations
---------------------------------------
grunt> describe emp;
emp: {eno: int,name: chararray,sal: double,gender: chararray,dno: int}
grunt>e1 =foreach emp generate eno,name;
grunt> e2 =foreach e1 generate
>> eno,toUpper(name) as name;
dump e2;
output
------
(101,VINO)
(102,SRI)
(103,MOHAN)
(104,LOKITHA)
cat > file3
10 20 30
40 50 60
70 50 30
copy to hdfs
------------
hadoop fs -put file1 /user/cloudera/
package pig.udf;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ToMax extends EvalFunc<Integer> {
@Override
public Integer exec(Tuple t) throws IOException {
Integer v1=(Integer)t.get(0);
Integer v2=(Integer)t.get(1);
Integer v3=(Integer)t.get(2);
if(v1>v2&&v1>v3){
return v1;
}
else if(v2>v3){
return v2;
}
else{
return v3;
}
}
}
create jar-->udf.jar
grunt>nums =load 'file3'
.....;
step1:
grunt>register jarname;
grunt> register /home/cloudera/Desktop/udf.jar;
step2:
grunt>define function_name
>>packagename.ClassName();
ex:
grunt> define big
>> pig.udf.ToMax();
grunt> max_bag =foreach nums generate a,b,c,
>> big(a,b,c) as big1;
concat
------
It is used to combine 2 strings
concat(string1,string2,.....);
cat > sfile
mano,papasani,34
venkat,banala,24
subbu,koutarapu,30
pra,parida,24
hari,bandla,32
grunt> sbag =load 'sfile'
>> using PigStorage(',')
>> as (fname:chararray,lname:chararray,age:int);
grunt> sbag1 =foreach sbag generate *,
>> CONCAT(lname,' ',fname,) as name;
STRSPLIT
--------
STRSPLIT(colname/stringname,delimiter,no of tokens);
cat > file1
Manohar is a good person
grunt>str =load 'file1'
as (line:charray);
grunt> spstr =foreach str generate STRSPLIT(line,' ',5);
TOKENIZE
--------
Divide a string into tokens.
Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple).
The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*).
syntax
------
TOKENIZE(colname/stringname)
grunt> tbag =foreach str generate TOKENIZE(line);
describe tbag;
grunt> describe tbag;
tbag: {bag_of_tokenTuples_from_line: {tuple_of_tokens: (token: chararray)}}
FLATTEN
-------
It is an operator applied to modify output
grunt> fbag =foreach tbag generate FLATTEN(bag_of_tokenTuples_from_line);
Diff
----
compares 2 fields in tuples.
Any tuples that are in one bag but not the other are returned in a bag. If the bags match, an empty bag is returned.
cat > sales1
[cloudera@quickstart ~]$ cat sales1
p1,2000
p2,3000
p1,4000
p1,5000
p2,4000
p3,5000
cat > sales2
[cloudera@quickstart ~]$ cat sales2
p1,6000
p2,8000
p1,1000
p1,5000
p1,6000
p2,6000
p2,8000
grunt> b1 =load 'sales1'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
grunt> b2 =load 'sales2'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
grunt>b3 =group b1 by name,b2 by name;
grunt>b4 =foreach b3 generate DIFF(b1,b2) as val1;
grunt>dump b4;
output
------
({(p1,4000.0),(p1,2000.0),(p1,1000.0),(p1,6000.0)})
({(p2,3000.0),(p2,4000.0),(p2,8000.0),(p2,6000.0)})
({(p3,5000.0)})
IsEmpty
-------
cheks whether a bag/map is empty or not
IsEmpty(bagname);
grunt> s = load 'file1' using PigStorage(',')
Initially each field is loaded as bytearray
but later converted into given datatypes
Foreach
-------
subsetting fields
grunt> data =load 'emp.txt' using PigStorage(',')
grunt> as (eno:int,name:chararray,sal:double,gender:chararray,dno:int);
grunt> describe data;
grunt> illustrate data;
Filtering
---------
grunt> emp = foreach data generate eno,name,sal;
generate new fields
-------------------
grunt> e =foreach data generate*,
>> sal*0.2 as tax;
cat > dataset1
My Name is Manohar
I Love Teaching
creating and loading data into a bag
------------------------------------
$bagname =load 'filename' as (line:chararray);
Examine data
------------
dump bagname;
Running pig in local
--------------------
pig -x local
grunt> dset1 =load 'dataset1'
>> as (line:chararray);
grunt>dump dset1;
creating a bag with multiple cols
---------------------------------
cat > data.txt
1\t2\t3
4\t5\t6
7\t8\t9
Note:The default delimiter in pig is tab(\t)
grunt> data = load 'data.txt' as (a:int,b:int,c:int);
grunt>dump dset2;
cat > file1
1,2,3
4,5,6
7,8,8
Load into pig
-------------
grunt> data = load 'file1' as (a:int,b:int,c:int);
grunt>dump data;
output
------
(,,)
(,,)
(,,)
Here the default file format is tab(\t) but we have loaded file with comma(,) separated,bag format is not compatible with file foramat
File with customized delimitor
-------------------------------
we can load a file into bag using customized delimitor by using the
PigStorage()
syntax
------
varname = load 'filename'
using PigStorage('symbol')
as (col1:type,col2:type,....);
example
--------
data =load 'file1'
using PigStorage(',')
as(a:int,b:int,c:int);
describe data;
dump data;
Stroring data from a bag/tuple from pig into lfs/hdfs
-----------------------------------------------------
store
-----
we use store command to prsist the data from a bag or a tuple in pig to a lfs/hdfs depends on mode we are running in.
syntax
------
store bagname into 'dirname'
using PigStorage('symbol');
example
-------
grunt> store data into 'numbers'
using PigStorage(',');
Now we find a file created under directory 'numbers' in either lfs/hdfs depends on mode we are running
In lfs
------
/home/cloudera/numbers/part-m-00000
In hdfs
-------
/user/cloudera/numbers/part-m-00000
$pig
Running in hdfs
Here pig had acces to hdfs
[cloudera@quickstart ~]$ cat > emp.txt
1,mano,1000.00,m,11
2,venkat,2000.00,m,12
3,subbu,3000.00,m,13
4,pra,4000.00,m,14
5,bond,5000.00,m,15
^C
load file into hdfs
-------------------
[cloudera@quickstart ~]$ hadoop fs -put /home/cloudera/emp.txt /user/cloudera/
create a bag and load data
--------------------------
grunt> emp =load 'emp.txt'
using PigStorage(',')
as (eno:int,name:chararray,sal:double,gender:chararray,dno:int);
grunt>describe emp;
grunt>dump emp;
describe
--------
It used to verify schema of a bag.
syntax
------
describe bagname;
ex:describe emp;
illustrate
----------
It is used diplay shema and row of a bag
example
-------
illustrate bagname;
illustrate emp;
Filtering data
--------------
Filter eno,name,sal from emp
we can filter data by using foreach
syntax
------
bagname = foreach bagname1 generate col1,col2,....;
select eno,name,sal from edset
grunt> employee =foreach edset generate eno,name,sal;
grunt>describe employee;
grunt>dump employee;
(1,mano,1000.00)
(2,............)
...
...
Adding additional fields to a bag
---------------------------------
varname =foreach bagname generat*,
val1 as col,val2 as col,....;
example
-------
grunt> emp1 =foreach emp generate*
>> , sal*0.1 as tax,sal*0.2 as hra;
Note:
we cannot resuse the alias cols in the same query it leads to error.
grunt> emp1 =foreach emp generate*,
>> sal*0.1 as tax,sal*0.2 as hra;
grunt> emp2 =foreach emp1 generate*,
>>sal-tax+hra as net;
Changing datatypes
-------------------
we can change the datatype of bag by applying casting to col
syntax
------
(type)colname
grunt> emp3 =foreach emp2 generate eno,name,sal,(int)tax,(int)hra,(int)net;
grunt> describe emp3;
emp3: {eno: int,name: chararray,sal: double,tax: int,hra: int,net: int}
renaming columns
----------------
bagname =foreach bagname1 generate col1 as newname1,......;
grunt> emp =foreach emp generate eno as ecode,name,sal,gender,dno;
conditional transformations
---------------------------
greatest of 2 numbers
greatest of 3 numbers
grunt> big =foreach file1 generate*,
>> (a>b?(a>c?a:c):(b>c?b:c)) as big;
copying relation to relation
----------------------------
grunt>bagname =foreach bagname1 generate*;
grunt> e =foreach emp generate*;
filter :-
to create subsets , based on given criteria.
(row filter, equivalant to 'where' clause of sql select statement )
syntax
------
bagname =filter bagname1 by (col1==value,.....);
grunt> erec = filter emp by (sex=='m');
grunt> dump e1
grunt> erec = filter emp by (sex=='m' and dno==12);
grunt> dump e1
Limit :-
to fetch top n number of tuples.
syntax
------
varname =limit bagname n;
grunt> top3 = limit emp 3;
grunt> dump top3;
sample:-
to create subsets, in random sample style.
syntax
------
varname =sample bagname percentage_num;
esample =sample emp 1;
display all recs from bag emp
esample =sample emp 0.5;
display 5*0.5 recs.
Aggregated functions in PigLatin.
SUM(), AVG(), MAX(), MIN(), COUNT()
grunt> emp2 = foreach emp1 generate SUM(sal) as tot;
grunt> dump emp2;
ABOVE statement will be failed during execution,.
bcoz, AGGREGATED functions are applied only on inner bags.
when you group data , inner bags will be produced.
group: -
to get inner bags foreach data group.
based on grouping field.
syntax
------
varname =group bagname by fieldname;
gbag =group emp1 by gender;
(f,{(9,,3000.0,f,12,600.0,120.0),(101,janaki,10000.0,f,12,2000.0,400.0),(104,lokitha,8000.0,f,12,1600.0,320.0),(102,Sri,25000.0,f,11,5000.0,1000.0)})
(m,{(1,,5000.0,m,12,1000.0,200.0),(105,naga,6000.0,m,13,1200.0,240.0),(103,mohan,13000.0,m,13,2600.0,520.0),(101,vino,26000.0,m,11,5200.0,1040.0)})
grunt> describe emp1;
emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int}
grunt>erec =foreach emp generate name,sal,gender,dno;
grunt>egrp =group erec by gender;
grunt> describe egrp;
egrp: {group: chararray,erec: {(name: chararray,sal: double,gender: chararray,dno: int)}}
grunt>tot =foreach egrp generate SUM(erec.sal) as total;
grunt>dump tot;
grunt> -- select sex, sum(sal) from emp group by sex
grunt> e = foreach emp generate sex, sal;
grunt> bySex = group e by sex;
grunt> describe bySex
bySex: {group: chararray,e: {sex: chararray,sal: int}}
grunt> dump bySex
grunt> res = foreach bySex generate
>> group as sex, SUM(e.sal) as tot;
grunt> describe res
res: {sex: chararray,tot: long}
grunt> store res into 'myhdfs1';
grunt> cat myhdfs1/part-r-00000
f 103000
m 125000
____________________________________
grunt> describe emp
emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int}
grunt> ee = foreach emp generate dno, sal;
grunt> byDno = group ee by dno;
grunt> res = foreach byDno generate
>> group as dno, SUM(ee.sal) as tot;
grunt> store res into 'pdemo/res1';
grunt> ls pdemo/res1
hdfs://localhost/user/training/pdemo/res1/_logs <dir>
hdfs://localhost/user/training/pdemo/res1/part-r-00000<r 1> 28
grunt> cat pdemo/res1/part-r-00000
11 51000
12 48000
13 129000
grunt>
grunt> -- single grouping and multiple aggregations
[cloudera@quickstart ~]$ cat emp.txt
1,aaaa,1000.00,f,11
2,bbbb,2000.00,m,12
3,cccc,3000.00,m,13
4,dddd,4000.00,f,14
5,eeee,5000.00,f,11
6,ffff,2000.00,m,15
7,gggg,3000.00,f,13
8,hhhh,4000.00,m,16
grunt> emp =load 'emp.txt'
>> using PigStorage(',')
>> as (eno:int,name:chararray,sal:double,gender:chararray,dno:int);
grunt> empgrp =group emp by gender;
grunt> describe empgrp;
empgrp: {group: chararray,emp: {(eno: int,name: chararray,sal: double,gender: chararray,dno: int)}}
grunt> empaggr =foreach empgrp generate
>>group as gender,
>> SUM(emp.sal) as sal
>> ,AVG(emp.sal) as avgsal,
>> MAX(emp.sal) as maxsal,
>> MIN(emp.sal) as minsal,
>> COUNT(emp.eno) as cnt;
grouping using multiple cols
---------------------------
syntax
------
varname =group bagname by (col1,col2,....);
grunt> mulgrp =group emp by (gender,dno);
grunt> describe mulgrp;
mulgrp: {group: (gender: chararray,dno: int),emp: {(eno: int,name: chararray,sal: double,gender: chararray,dno: int)}}
grunt>dump mulgrp;
((f,11),{(5,eeee,5000.0,f,11),(1,aaaa,1000.0,f,11)})
((f,13),{(7,gggg,3000.0,f,13)})
((f,14),{(4,dddd,4000.0,f,14)})
((m,12),{(2,bbbb,2000.0,m,12)})
((m,13),{(3,cccc,3000.0,m,13)})
((m,15),{(6,ffff,2000.0,m,15)})
((m,16),{(8,hhhh,4000.0,m,16)})
grunt> res1 = foreach gender generate
>> group as gender,
>> SUM(e.sal) as tot,
>> AVG(e.sal) as avg,
>> MAX(e.sal) as min,
>> MIN(e.sal) as mn,
>> COUNT(e) as cnt;
grunt> dump res1
(f,103000,20600.0,50000,8000,5)
(m,125000,25000.0,50000,6000,5)
_________________________________
grunt> -- multi grouping..
grunt> e = foreach emp generate dno, sex, sal;
grunt> grp = group e by dno, sex;
2016-06-09 19:28:35,893 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Unrecognized alias sex
Details at logfile: /home/training/pig_1465437040017.log
grunt>
above statement is invalid.
pig does not allow groping by multiple fields.
solution:
make multiple fields as a tuple field, and group it by tuple.
grunt> grp = group e by (dno, sex);
grunt> describe grp
grp: {group: (dno: int,sex: chararray),e: {dno: int,sex: chararray,sal: int}}
grunt> res = foreach grp generate
>> group.dno, group.sex, SUM(e.sal) as tot;
grunt> dump res
(11,f,25000)
(11,m,26000)
(12,f,18000)
(12,m,30000)
(13,f,60000)
(13,m,69000)
_________________________________
grunt> -- select sum(sal) from emp;
grunt> -- old one
grunt> e = foreach emp generate 'ibm' as org, sal;
grunt> dump e;
(ibm,26000)
(ibm,25000)
(ibm,13000)
(ibm,8000)
(ibm,6000)
(ibm,10000)
(ibm,30000)
(ibm,50000)
(ibm,10000)
(ibm,50000)
grunt> grp = group e by org;
grunt> res = foreach grp generate
>> SUM(e.sal) as tot;
grunt> dump res
(228000)
_____________________________________
2nd one --- for entire column aggregation.
grunt> describe emp;
emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int}
grunt> e = foreach emp generate sal;
grunt> grp = group e all;
grunt> dump grp
(all,{(26000),(25000),(13000),(8000),(6000),(10000),(30000),(50000),(10000),(50000)})
grunt> res = foreach grp generate
>> SUM(e.sal) as tot,
>> AVG(e.sal) as avg, MAX(e.sal) as max,
>> MIN(e.sal) as min, COUNT(e) as cnt;
grunt> dump res
(228000,22800.0,50000,6000,10)
grouping using multiple columns
-------------------------------
cogroup:-
_________
To get seperate inner bags (data groups) for each dataset.
so that, we can perform seperate aggregations on each data set.
varname =cogroup bag1 by col1,bag2 by col2,........;
cat > ds1
_______
(a,10)
(b,20)
(a,30)
(b,40)
_________
cat> ds2
_________
(a,30)
(c,30)
(c,40)
(a,20)
Processing xml data
-------------------
cat > course.txt
<course><name>hadoop</name><price>6000</price><dur>60</dur></course>
<course><name>java</name><price>1500</price><dur>90</dur></course>
hive>create table coursedata(line string);
hive>load data local inpath 'course.txt' into table coursedata;
hive>select *from coursedata;
coverting semistructured data into structured
---------------------------------------------
hive> create table course_details(name string,price double,dur int);
loading data from cousedata by using table to table copy
--------------------------------------------------------
hive> insert overwrite table course_details
> select xpath_string(line,'course/name'),
> xpath_string(line,'course/price'),
> xpath_string(line,'course/dur')
> from coursedata;
hive>select *from course_details;
xpath_string()
--------------
xml is a used to represent data heirarchical form.
we should from root-->child-->subchild etc,to perform such navigation we use the method xpath_string() to read data from a tag of an xml.
syntax
------
xpath_string(line,'roottag/childtag/...');
selecting each property from an xml record
------------------------------------------
hive> select xpath_string(line,'course/name'),
> xpath_string(line,'course/price'),
> xpath_string(line,'course/dur')
> from coursedata;
cogroup:-
_________
To get seperate inner bags (data groups) for each dataset.
so that, we can perform seperate aggregations on each data set.
syntax
------
varname =cogroup bag1 by col1,bag2 by col1;
data set1
_______
(a,10)
(b,20)
(a,30)
(b,40)
_________
data set2
_________
(a,30)
(c,30)
(c,40)
(a,20)
_________
dset1 =load 'ds1' using PigStorage(',')
as (pid:chararray,price:double);
dset2 =load 'ds2' using PigStorage(',')
as (pid:chararray,price:double);
grunt>cg =cogroup dset1 by pid,dset2 by pid;
grunt> sp =foreach cg generate
>> group as id,SUM(dset1.price) as tot1,
>>SUM(dset2.price) as tot2;
cat > sales1
p1,2000
p2,3000
p1,4000
p1,5000
p2,4000
p3,5000
cat > sales2
p1,6000
p2,8000
p1,1000
p1,5000
p1,6000
p2,6000
p2,8000
[training@localhost ~]$ hadoop fs -copyFromLocal sales1 /user/cloudera/
grunt> s1 = load 'pdemo/sales1'
>> using PigStorage(',')
>> as (pid:chararray, price:int);
grunt> s2 = load 'pdemo/sales2'
>> using PigStorage(',')
>> as (pid:chararray, price:int);
grunt> cg = cogroup s1 by pid, s2 by pid;
grunt> describe cg
cg: {group: chararray,s1: {pid: chararray,price: int},s2: {pid: chararray,price: int}}
grunt> dump cg
(p1,{(p1,2000),(p1,4000),(p1,5000)},{(p1,6000),(p1,1000),(p1,5000),(p1,6000)})
(p2,{(p2,3000),(p2,4000)},{(p2,8000),(p2,6000),(p2,8000)})
(p3,{(p3,5000)},{})
grunt> sg =cogroup sal1 by name,sal2 by name;
grunt> res = foreach sg generate
>> group as pid, SUM(s1.price) as tot1,
>> SUM(s2.price) as tot2;
grunt> dump res
(p1,11000,18000)
(p2,7000,22000)
(p3,5000,)
grunt> cg =foreach cg generate
>> name,(tot1 is null?0:tot1) as tot1,
>> (tot2 is null?0:tot2) as tot2;
(p1,11000,18000)
(p2,7000,22000)
(p3,5000,0)
union:-
______
union is used to combine 2 or more datasets.
syntax
------
varname =union bag1,bag2,....;
Note:Number of cols and order must be same.
ds1
________
name, sal
__________
aaa,10000
bbbb,30000
cccc,40000
___________
ds2
____________
name,sal
___________
xxx,30000
yyy,40000
zzz,60000
______________
grunt> ds1 =load 'dataset1'
>> using PigStorage(',')
>> as (pid:chararray,price:double);
grunt> ds2 =load 'dataset2'
>> using PigStorage(',')
>> as (pid:chararray,price:double);
e = union ds1 , ds2
if files has diffent schema
---------------------------
convert the format of all files to the same format.
ds3
_________
sal, name
___________
10000,abc
20000,def
____________
ds4 = foreach ds3 generate name, sal;
ds = union ds1, ds2, ds4;
___________________________
if files have different number of fields.
$ cat > e1
mano,30000,m
manoja,40000,f
vidhatri,50000,m
cat > e2
vijay,60000,11
lavanya,70000,12
e1 =load 'e1.txt'
using PigStorage(',')
as (name:chararray,sal:double,gender:chararray);
grunt> e2 =foreach e1 generate*,0 as dno;
e3 =load 'e2.txt'
using PigStorage(',')
as (name:chararray,sal:double,dno:int);
grunt> e4 =foreach e3 generate name,sal,'*' as gender,dno;
e2_e4 =union e2,e4;
grunt> e1 = load 'pdemo/e1' using PigStorage(',')
>> as (name:chararray, sal:int, sex:chararray);
grunt> e2 = load 'pdemo/e2' using PigStorage(',')
>> as (name:chararray, sal:int, dno:int);
grunt> ee1 = foreach e1 generate *,0 as dno;
grunt> ee2 = foreach e2 generate name,
>> sal,'*' as sex, dno;
grunt> ee = union ee1, ee2;
distinct:- to eliminate duplicates
syntax
------
varname =distinct bagname;
ds
id,name
(101,a)
(102,b)
(103,c)
(101,a)
(102,b)
(101,a)
(102,b)
(102,x)
_____________
ds2 = distinct ds;
(101,a)
(102,b)
(103,c)
(102,x)
order:
______
sort records of bag.
syntax
------
varname =order bagname by colname;
e = order emp by name;
e2 = order emp by sal desc;
e3 = order emp by sal desc,
dno , sex desc;
--- no limit for max number of sort fields.
Join
----
A Join is used to fetch records from 2 or more bags/relation.
we have 2 joins
i)inner
returns matching records between 2 bags
grunt> varname =join bagname1 by colname,bagname2 by colname;
[cloudera@quickstart ~]$ cat emp.txt
101,subbu,26000,m,11
102,mano,25000,f,11
103,prathush,13000,m,13
104,vidahtri,8000,f,12
105,naga,6000,m,13
101,hari,10000,f,12
201,venkat,30000,m,12
202,bbbb,50000,f,13
203,ccc,10000,f,13
204,ddddd,50000,m,13
304,xxx,70000,m,14
305,yyy,80000,f,15
[cloudera@quickstart ~]$ cat dept
cat: dept: No such file or directory
[cloudera@quickstart ~]$ cat dept.txt
11,marketing,hyd
12,hr,del
13,finance,hyd
20,prod,hyd
21,admin,chennai
[cloudera@quickstart ~]$ cat emp.txt
101,vino,26000,m,11
102,Sri,25000,f,11
103,mohan,13000,m,13
104,lokitha,8000,f,12
105,naga,6000,m,13
101,janaki,10000,f,12
201,aaa,30000,m,12
202,bbbb,50000,f,13
203,ccc,10000,f,13
204,ddddd,50000,m,13
304,xxx,70000,m,14
305,yyy,80000,f,15
[cloudera@quickstart ~]$ cat dept
[cloudera@quickstart ~]$ cat dept.txt
11,marketing,hyd
12,hr,del
13,finance,hyd
20,prod,hyd
21,admin,chennai
emp =load 'emp.txt'
using PigStorage(',')
as(eno:int,name:chararray,sal:double,gender:chararray,dno:int);
dept =load 'dept.txt'
using PigStorage(',')
as(dno:int,dept:chararray,loc:chararray);
grunt> emp_ij =join emp by dno,dept by dno;
grunt> describe emp_ij;
emp_ij: {emp::eno: int,emp::name: chararray,emp::sal: double,emp::gender: chararray,emp::dno: int,dept::dno: int,dept::dept: chararray,dept::loc: chararray}
ii)outer
left outer
All records from left bag and matching records from right bag
syntax
------
grunt> varname =join bagname1 by colname left outer,bagname2 by colname;
ex:
grunt> emp_lj =join emp by dno left outer,dept by dno;
right outer
-----------
returns matching rows from left bag and all rows from right bag.
syntax
------
grunt> varname =join bagname1 by colname right outer,bagname2 by colname;
ex:
grunt> emp_rj =join emp by dno right outer,dept by dno;
full outer
----------
returns all rows from left nd right bags
grunt> varname =join bagname1 by colname full outer,bagname2 by colname;
ex:
grunt> emp_rj =join emp by dno full outer,dept by dno;
displaying selected fiedls from a join
--------------------------------------
grunt> emp_data =foreach emp_ij generate
>> emp::eno as eno,
>> emp::name as name,
>> dept::dno as dno;
grunt> emp_fil =filter emp_ij by emp::sal>=20000 and emp::sal<=500000;
grunt> emp_dept_rj =join emp by dno full outer,dept by dno;
task:
getting top3 salaried list.
(case: a salary can be taken by multiple people).
cat > samps
aaa,10000
bbb,80000
ccc,90000
ddd,90000
eeee,90000
ffff,80000
mmmmm,80000
nnnnn,70000
nnnn,70000
nn,60000
m,65000
xx,10000
hadoop fs -copyFromLocal samps /user/cloudera/
grunt> data =load 'samps'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
2017-04-26 20:53:15,674 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-04-26 20:53:15,674 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> saldata =foreach data generate sal;
grunt> osal =order saldata by sal desc;
grunt> top3 =limit osal;
2017-04-26 20:55:10,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 19, column 16> Syntax error, unexpected symbol at or near ';'
Details at logfile: /home/cloudera/pig_1493233521379.log
grunt> top3 =limit osal 3;
[training@localhost ~]$
grunt> e = load 'pdemo/samps'
>> using PigStorage(',')
>> as (name:chararray, sal:int);
grunt> sals = foreach e generate sal;
grunt> sals = distinct sals;
grunt> sals2 = order sals by sal desc;
grunt> top3 = limit sals2 3;
grunt> dump top3
grunt> describe top3
top3: {sal: int}
grunt> describe e
e: {name: chararray,sal: int}
grunt> res = join e by sal , top3 by sal;
grunt> describe res;
res: {e::name: chararray,e::sal: int,top3::sal: int}
grunt> res = foreach res generate e::name as name,
>> e::sal as sal;
grunt> dump res
(nnnnn,70000)
(nnnn,70000)
(bbb,80000)
(ffff,80000)
(mmmmm,80000)
(ccc,90000)
(ddd,90000)
(eeee,90000)
_____________________________________
Cross:
gives cartisian product.
used for non-equi functionalities of joins.
[training@localhost ~]$ cat > matrimony
Ravi,25,m
Rani,24,f
Ilean,23,f
trisha,27,f
Kiran,29,m
madhu,22,m
avi,26,m
srithi,21,f
[training@localhost ~]$ hadoop fs -copyFromLocal matrimony pdemo
[training@localhost ~]$
grunt> matri = load 'pdemo/matrimony'
>> using PigStorage(',')
>> as (name:chararray, age:int, sex:chararray);
grunt> males = filter matri by (sex=='m');
grunt> fems = filter matri by (sex=='f');
grunt> cr = cross males, fems;
grunt> describe cr
cr: {males::name: chararray,males::age: int,males::sex: chararray,fems::name: chararray,fems::age: int,fems::sex: chararray}
grunt> mf = foreach cr generate males::name as mname, fems::name as fname , males::age as mage,
>> fems::age as fage;
grunt>
grunt> describe mf
mf: {mname: chararray,fname: chararray,mage: int,fage: int}
grunt> mlist = filter mf by
>> (mage>fage and (mage-fage)<4);
grunt> dump mlist;
(madhu,srithi,22,21)
(avi,Rani,26,24)
(avi,Ilean,26,23)
(Kiran,trisha,29,27)
(Ravi,Rani,25,24)
(Ravi,Ilean,25,23)
_________________________________
to submit scripts
3 commands:
i)Pig
ii)exec
iii)run
pig to submit from command prompt.
aliases will not be available in grunt.
exec- to submit script from grunt shell. aliases will not be available.
run- to submit script from grunt,
aliases will be available.
so that we reuse them.
[training@localhost ~]$ cat script1.pig
emp = load 'pdemo/emp' using PigStorage(',')
as (id:int, name:chararray, sal:int, sex:chararray, dno:int);
e = foreach emp generate sex, sal;
bySex = group e by sex;
res = foreach bySex generate group as sex, SUM(e.sal) as tot;
dump res
$ pig script1.pig
grunt> exec script1.pig
grunt> run script1.pig
______________________
register, define.
_____________________
Displaying top3 highest sal
---------------------------
runt> cat samps;
aaa,10000
bbb,80000
ccc,90000
ddd,90000
eeee,90000
ffff,80000
mmmmm,80000
nnnnn,70000
nnnn,70000
nn,60000
m,65000
xx,10000
grunt>
grunt> bags =load 'samps'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
grunt> salbag =foreach bags generate sal;
grunt> osb =order salbag by sal desc;
grunt> top3 =limit osb 3;
grunt> dump top3
cartesian product
-----------------
syntax
------
varname =cross bag1,bag2;
Ravi,25,m
Rani,24,f
Ilean,23,f
trisha,27,f
Kiran,29,m
madhu,22,m
avi,26,m
srithi,21,f
Divide into male and female bags and apply cartesian product
mat =load 'matrimony'
using PigStorage(',')
as (name:chararray,age:int,gender:chararray);
mat1 =filter mat by (gender=='m');
mat2 =filter mat by (gender=='f');
grunt> fm =cross mat1,mat2;
dump fm;
Pig udf
-------
package pig.udf;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ToBigCase extends EvalFunc<String> {
@Override
public String exec(Tuple t) throws IOException {
String val = (String) t.get(0);
String bname=val.toUpperCase();
return bname;
}
}
Goto project -->MyUdf-->export-->java-->Jar file-->select path where to store jar(select export destination)-->finish.
/home/cloudera/Desktop/udf.jar
Resgistering and defining an udf in pig environment
---------------------------------------------------
step1:
grunt>register jarname;
grunt> register /home/cloudera/Desktop/udf.jar;
step2:
grunt>define function_name
>>packagename.ClassName();
ex:
grunt> define toUpper
>> pig.udf.ToBigCase();
consuming udf defined in pig operations
---------------------------------------
grunt> describe emp;
emp: {eno: int,name: chararray,sal: double,gender: chararray,dno: int}
grunt>e1 =foreach emp generate eno,name;
grunt> e2 =foreach e1 generate
>> eno,toUpper(name) as name;
dump e2;
output
------
(101,VINO)
(102,SRI)
(103,MOHAN)
(104,LOKITHA)
cat > file3
10 20 30
40 50 60
70 50 30
copy to hdfs
------------
hadoop fs -put file1 /user/cloudera/
package pig.udf;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ToMax extends EvalFunc<Integer> {
@Override
public Integer exec(Tuple t) throws IOException {
Integer v1=(Integer)t.get(0);
Integer v2=(Integer)t.get(1);
Integer v3=(Integer)t.get(2);
if(v1>v2&&v1>v3){
return v1;
}
else if(v2>v3){
return v2;
}
else{
return v3;
}
}
}
create jar-->udf.jar
grunt>nums =load 'file3'
.....;
step1:
grunt>register jarname;
grunt> register /home/cloudera/Desktop/udf.jar;
step2:
grunt>define function_name
>>packagename.ClassName();
ex:
grunt> define big
>> pig.udf.ToMax();
grunt> max_bag =foreach nums generate a,b,c,
>> big(a,b,c) as big1;
concat
------
It is used to combine 2 strings
concat(string1,string2,.....);
cat > sfile
mano,papasani,34
venkat,banala,24
subbu,koutarapu,30
pra,parida,24
hari,bandla,32
grunt> sbag =load 'sfile'
>> using PigStorage(',')
>> as (fname:chararray,lname:chararray,age:int);
grunt> sbag1 =foreach sbag generate *,
>> CONCAT(lname,' ',fname,) as name;
STRSPLIT
--------
STRSPLIT(colname/stringname,delimiter,no of tokens);
cat > file1
Manohar is a good person
grunt>str =load 'file1'
as (line:charray);
grunt> spstr =foreach str generate STRSPLIT(line,' ',5);
TOKENIZE
--------
Divide a string into tokens.
Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple).
The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*).
syntax
------
TOKENIZE(colname/stringname)
grunt> tbag =foreach str generate TOKENIZE(line);
describe tbag;
grunt> describe tbag;
tbag: {bag_of_tokenTuples_from_line: {tuple_of_tokens: (token: chararray)}}
FLATTEN
-------
It is an operator applied to modify output
grunt> fbag =foreach tbag generate FLATTEN(bag_of_tokenTuples_from_line);
Diff
----
compares 2 fields in tuples.
Any tuples that are in one bag but not the other are returned in a bag. If the bags match, an empty bag is returned.
cat > sales1
[cloudera@quickstart ~]$ cat sales1
p1,2000
p2,3000
p1,4000
p1,5000
p2,4000
p3,5000
cat > sales2
[cloudera@quickstart ~]$ cat sales2
p1,6000
p2,8000
p1,1000
p1,5000
p1,6000
p2,6000
p2,8000
grunt> b1 =load 'sales1'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
grunt> b2 =load 'sales2'
>> using PigStorage(',')
>> as (name:chararray,sal:double);
grunt>b3 =group b1 by name,b2 by name;
grunt>b4 =foreach b3 generate DIFF(b1,b2) as val1;
grunt>dump b4;
output
------
({(p1,4000.0),(p1,2000.0),(p1,1000.0),(p1,6000.0)})
({(p2,3000.0),(p2,4000.0),(p2,8000.0),(p2,6000.0)})
({(p3,5000.0)})
IsEmpty
-------
cheks whether a bag/map is empty or not
IsEmpty(bagname);
grunt> s = load 'file1' using PigStorage(',')
Initially each field is loaded as bytearray
but later converted into given datatypes
Foreach
-------
subsetting fields
grunt> data =load 'emp.txt' using PigStorage(',')
grunt> as (eno:int,name:chararray,sal:double,gender:chararray,dno:int);
grunt> describe data;
grunt> illustrate data;
Filtering
---------
grunt> emp = foreach data generate eno,name,sal;
generate new fields
-------------------
grunt> e =foreach data generate*,
>> sal*0.2 as tax;
No comments:
Post a Comment