Monday, June 19, 2017

Hive-Partition_tables

Partition table
---------------
Dividing a table with large datasets into parts called as partitions according to a column(s) of a table is known as partition tables.

Partition tables reduces the search time of a query in a table with large datasets according to a partition insted of searching all the records of a table.

Partitions are logical division of a table.

creating a partition table
--------------------------
We can create a partition table by using the commmad
"partitioned by" along with a create statement.

syntax
------
create table tablename(col1 type,col2 type,.....)
partitioned by(col1 type,col2,....)
row format delimited fields terminated by ',';

ex:
hive> create table emp_part(eno int,name string,sal double,gender string,dno int)
    > partitioned by(g string)
    > row format delimited fields terminated by ',';

describe emp_part
-----------------
eno                 int                                  
name                 string                                
sal                 double                                
gender               string                                
dno                 int                                  
g                   string                                

# Partition Information
# col_name             data_type           comment            

g                   string

In the above the firts part is physical partition and second part is logical partition.

Here col names of table and partition colname must not be same.

creating a non partition table
------------------------------
hive> create table empdata(eno int,name string,sal double,gender string,dno int)
   > row format delimited fields terminated by ',';

cat > emp
305,akanksha,80000,f,15
101,mano,26000,m,11
102,sajay,25000,f,11
103,naveen,13000,m,13
104,chakrismart,8000,f,12
105,ravi,6000,m,13
101,hari,10000,f,12
201,aaa,30000,m,12
202,bbbb,50000,f,13
203,ccc,10000,f,13
204,ddddd,50000,m,13
304,xxx,70000,m,14


hive>load data local inpath 'emp' into table empdata;

Loading data into a partition table from a non partition table
--------------------------------------------------------------
syntax
------
insert overwrite table tablename
partition(col1=val,col2=val)
select query ....;


hive> insert overwrite table emp_partition
    > partition(g='f')
    > select *from  from empnp where gender='f';


hive> insert overwrite table emp_partition
    > partition(g='m')
    > select *from empnp where gender='m';


Here emp_partition is created as directory
Every partition is created as a sub directory in emp_partition
Inside partition a file is created as 000000_0.

In hdfs
-------
/user/hive/warehouse/emp_partition/g=f/000000_0-->female partitiom
All records of female are available in this.

/user/hive/warehouse/emp_partition/g=m/000000_0-->male partitiom
All records of male are available in this.


selecting records from partition
--------------------------------
hive>select *from emp_partition;
displays all records

hive>select *from emp_partition where g='m';
displays all records from male partition.


hive>select *from emp_partition where g='f';
displays all records from female partition.

Multiple partition in hive tables
---------------------------------
create a table with partitions using multiple cols dno,gender

hive> create table emp_mpart(eno int,name string,sal double)
    > partitioned by(dno int,gender string)
    > row format delimited fields terminated by ',';

load data into partitions from a non partition table
----------------------------------------------------
hive> insert overwrite table emp_mpart
    > partition(dno=11,gender='m')
    > select eno,name,sal from empdata where dno=11 and gender='m';

hive> insert overwrite table emp_mpart
    > partition(dno=11,gender='f')
    > select eno,name,sal from empdata where dno=11 and gender='f';


selecting records
-----------------
hive>select *from emp_mpart;
displays all records

hive>select *from emp_mpart where dno=11;
displays both male and female records from dept 11

hive>select *from emp_mpart where dno=11 and gender='m';
displays all male records from dept 11


hive>desc emp_mpart
eno                 int                                  
name                 string                                
sal                 double                                
dno                 int                                  
gender               string                                

# Partition Information
# col_name             data_type           comment            

dno                 int                                  
gender               string    

In hdfs partitions are created according to dno and gender

/user/hive/warehouse/emp_mpart/dno=11---->g=m-->000000_0
                                     ---->g=f-->000000_0

Here dno is a subdirectory to table.
g=m,g=f are subdirectory to dno.
000000_0 is data partition file according to gender.


Multiple inserts into partition table
-------------------------------------
we can perform all inserts at a time

hive>from empdata
     insert overwrite table emp_mpart
     partition(dno=11,gender='m')
     select eno,name,sal where dno=11 and gender='m')
     insert overwrite table emp_mpart
     partition(dno=12,gender='m')
     select eno,name,sal where dno=12 and gender='m')
     insert overwrite table emp_mpart
     partition(dno=13,gender='m')
     select eno,name,sal where dno=13 and gender='m')
     insert overwrite table emp_mpart
     partition(dno=14,gender='m')
     select eno,name,sal where dno=14 and gender='m')

dynamic partition
-----------------
suppose if we have 200 depts,here we have create 200 partitions we should load datasets explicitly (manually) dept wise ,to avoid this we go for dynamic partition.

In dynamic partition,partition of table is implicitly done based on a column(s).

To create dynamic partition we should set 2 properties
hive>set hive.exec.dynamic.partition=true;
hive>set hive.exec.dynamic.partition.mode=nonstrict;

create a dynamic table
----------------------
create table emp_dpart(eno int,name string,sal double)
partitioned by(dno int,gender string)
row form ......;

copy data from a non partition table
------------------------------------
insert overwrite table emp_dpart
partition(dno,gender)
select eno,name,sal,dno,gender from empdata;

Here we are inserting values into partion without specifying dno or gender explicitly i,e static representation.


examine data
-----------
hive>select *from emp_dpart;


In hdfs

/user/hive/warehouse/emp_dpart/dno=11-->gender='m'-->000000_0
                                     -->gender='f'-->000000_0

Here dno=11 is subdirectory of emp_dpart
gender='m' and gender='f' is a sub directory of dno=11

similary for each dno partitions are created as shown above.


querying partition dno
querying partition dno and subpartition gender

hive> select *from emp_dpart where depno=11;
OK
101 AAAAA 1000.0 11 m
Time taken: 0.126 seconds, Fetched: 1 row(s)
hive> select *from emp_dpart where depno=11 and gender='m';
OK
101 AAAAA 1000.0 11 m
Time taken: 0.262 seconds, Fetched: 1 row(s)
hive> select *from emp_dpart where depno=11 and gender='f';
OK

Limitations
 -----------
As number of partitions increases,metadata size of namenode increases which becomes a burdon on namenode.

suppose we 100 dno and 100 cities and we are doing partition based on (dno,city) it creates 10000 partitions which eventually increases the size of metadata.

To overcome this we can implement bucketing.

No comments: