Hive tutorial 1 – hive internal and external table, hive ddl, hive partition, hive buckets and hive serializer and deserializer

The concept of a table in Hive is very similar to the table in the relational database. Each table associates with a directory configured in ${HIVE_HOME}/conf/hivesite.xml in HDFS. By default, it is /user/hive/warehouse in HDFS. For example, /user/hive/warehouse/employee is created by Hive in HDFS for the employee table. All the data in the table will be kept in the directory. The Hive table is also referred to as internal or managed tables.

When there is data already in HDFS, an external Hive table can be created to describe the data. It is called EXTERNAL because the data in the external table is specified in the LOCATION properties instead of the default warehouse directory. When keeping data in the internal tables, Hive fully manages the life cycle of the table and data. This means the data is removed once the internal table is dropped. If the external table is dropped, the table metadata is deleted but the data is kept. Most of the time, an external table is preferred to avoid deleting data along with tables by mistake.

Example of an internal table schema and loading the data


CREATE TABLE IF NOT EXISTS lookup_table_internal(
vendor_id string COMMENT 'vendor id',
sys_vendor_id string COMMENT 'system vendor id',
version_number string ,
vendor_name string,
index string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/data/employee.txt' OVERWRITE INTO TABLE lookup_table_internal;

Example of an external table schema and loading the data


CREATE EXTERNAL TABLE lookup_table_external(
vendor_id string COMMENT 'vendor id',
sys_vendor_id string COMMENT 'system vendor id',
version_number string ,
vendor_name string,
index string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION
'hdfs://nameservice/user/hive/lookup_table_external';

LOAD DATA LOCAL INPATH '/home/data/employee.txt' OVERWRITE INTO TABLE lookup_table_external;

Note : The Hive table does not have constraints such as a database yet. If the folder in the path does not exist in the LOCATION property, Hive will create that folder. If there is another folder inside the folder specified in the LOCATION property, Hive will NOT report errors when creating the table, but will report an error when querying the table.

Create the table as select


CREATE TABLE lookup_table AS SELECT * FROM lookup_table_external;

CTAS copies the data as well as table definitions. The table created by CTAS is atomic which means that other users do not see the table until all the query results are populated. CTAS has restrictions like the table created cannot be a partitioned table,an external table or a list of bucketing table.

Drop Command

The drop table’s command removes the metadata completely and moves data to Trash or to the current directory if Trash is configured.


DROP TABLE IF EXISTS lookup_table_internal;

Truncate Command

The truncate table’s command removes all the rows from a table and it should be an internal table. It does not delete the table.


TRUNCATE TABLE lookup_table_internal.

Alter Table Name Command
ALTER TABLE lookup_table_internal RENAME TO lookup_table;
Alter Table Properties

ALTER TABLE lookup_table SET TBLPROPERTIES ('comment'='new look up table comments');

ALTER TABLE lookup_table SET FILEFORMAT RCFILE;

ALTER TABLE lookup_table SET LOCATION 'hdfs://nameservice/user/hive/lookup_table';

Alter the tables enable or disable protection to NO_DROP, which prevents a table from being dropped, or OFFLINE, which prevents data and not metadata in a table from being queried:


ALTER TABLE lookup_table ENABLE NO_DROP;
ALTER TABLE lookup_table DISABLE NO_DROP;
ALTER TABLE lookup_table ENABLE OFFLINE;
ALTER TABLE lookup_table DISABLE OFFLINE;

Alter the table’s concatenation to merge small files into larger files

1. Convert to the file format supported – ALTER TABLE lookup_table SET FILEFORMAT ORC;

2. Concatenate files – ALTER TABLE lookup_table CONCATENATE;

3. Convert to the regular file format – ALTER TABLE lookup_table SET FILEFORMAT TEXTFILE;

Changing column order

Lets say we want to change the order of vendor_id and move it after version_number;


ALTER TABLE lookup_table CHANGE vendor_id vendor_id string AFTER version_number;

ALTER TABLE lookup_table CHANGE neid element_id string FIRST;

Alter the column’s name and order

The column name vendor_id will be renamed to element_id and will be ordered first.


ALTER TABLE lookup_table CHANGE vendor_id element_id string FIRST;

Adding columns

ALTER TABLE lookup_table ADD COLUMNS (comments string);

Replace all columns other than the partition columns


ALTER TABLE lookup_table REPLACE COLUMNS (id string,name string);

Note : The ALTER command will only modify Hive metadata and not the data. Users should make sure the actual data conforms with the metadata definition manually.

 

Hive Partition

By default, a simple query in Hive scans the whole Hive table. This slows down the performance when querying a large-size table. The issue could be resolved by creating Hive partitions, which is very similar to what’s in the RDBMS. In Hive, each partition corresponds to a predefined partition column(s) and stores it as a subdirectory in the table’s directory in HDFS. When the table gets queried, only the required partitions of data in the table are queried, so the I/O and time of query is greatly reduced. It is very easy to implement Hive partitions when the table is created and check the partitions.

The partition is created with table schema and below is an example where we are partitioning the data based on the datestamp.

CREATE EXTERNAL TABLE baseline_table(
index string,
vdr_id string,
sys_vdr_id string,
ver_num string,
supplier_id string,
neid string,
portid string)
PARTITIONED BY (
datestamp string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION
'hdfs://nameservice/user/queuename/hive_warehouse/baseline_table';

The partition is not enabled automatically. We have to use ALTER TABLE ADD PARTITION to add partitions to a table. The ADD PARTITION command changes the table’s metadata, but does not load data. If the data does not exist in the partition location, queries will not return any results. To drop the partition including both data and metadata, use the ALTER TABLE DROP PARTITION statement as below


alter table baseline_table add IF NOT EXISTS partition (datestamp='${date}');

alter table baseline_table drop IF EXISTS PARTITION (datestamp='${date}');

To load or overwrite data in partition, we can use the LOAD or INSERT OVERWRITE statements. The statement only overwrites the data in the specified partitions.


LOAD DATA LOCAL INPATH '/home/data.txt' OVERWRITE INTO TABLE baseline_table PARTITION (datestamp='${date}');

The alter table/partition statement for file format, location, protections, and concatenation has the same syntax as the alter table statements and is shown here.


ALTER TABLE baseline_table PARTITION partition_spec SET FILEFORMAT file_format;
ALTER TABLE baseline_table PARTITION partition_spec SET LOCATION 'full URI';
ALTER TABLE baseline_table PARTITION partition_spec ENABLE NO_DROP;
ALTER TABLE baseline_table PARTITION partition_spec ENABLE OFFLINE;
ALTER TABLE baseline_table PARTITION partition_spec DISABLE NO_DROP;
ALTER TABLE baseline_table PARTITION partition_spec DISABLE OFFLINE;
ALTER TABLE baseline_table PARTITION partition_spec CONCATENATE;

Hive buckets

Besides partition, bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from partition, the bucket corresponds to segments of files in HDFS. For example, the baseline_table table from the previous section uses the datestamp as the toplevel partition. If there is a further request to use the neid as the second level of partition, it leads to many deep and small partitions and directories. So we can bucket the baseline_table using neid as the bucket column. The value of this column will be hashed by a user-defined number into buckets. The records with the same neid will always be stored in the same bucket that is segment of files. By using buckets, Hive can easily and efficiently do sampling and map side joins as the data belonging to the same key will be available in the same file.

CREATE EXTERNAL TABLE baseline_table(
index string,
vdr_id string,
sys_vdr_id string,
ver_num string,
supplier_id string,
neid string,
portid string)
CLUSTERED BY (neid) INTO 22 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION
'hdfs://nameservice/user/queuename/hive_warehouse/baseline_table';

 

Note : To define the proper number of buckets, we should avoid having too much or too little of data in each bucket. A better choice is somewhere near two blocks of data. For example, we can plan 512 MB of data in each bucket, if the Hadoop block size is 256 MB.

Bucketing has close dependency on the underlying data loaded. To properly load data to a bucket table, we need to either set the maximum number of reducers to the same number of buckets specified in the table creation in the above table it should be 22 or enable
enforce bucketing as follows.

set map.reduce.tasks = 22;

set hive.enforce.bucketing = true;

To populate the data to the bucket table, we cannot use LOAD keywords as with regular tables since LOAD does not verify the data against the metadata. Instead, INSERT should be used to populate the bucket table.

Views

In Hive, views are logical data structures that can be used to simplify queries by either hiding the complexities such as joins, subqueries, and filters or by flatting the data. Unlike some RDBMS, Hive views do not store data or get materialized. Once the Hive view is created, its schema is frozen immediately. Subsequent changes to the underlying tables for example like adding a column will not be reflected in the view’s schema. If an underlying table is dropped or changed, subsequent attempts to query the invalid view will fail.


CREATE VIEW delta_view AS
SELECT a.column1, a.column2, a.column3, a.column4, a.column5, a.column6
FROM (
SELECT i.column1, i.column2, i.column3, i.column4, i.column5, i.column6, abs(f.column6-i.column6) as deltaValue
FROM table1 i
LEFT OUTER JOIN table2 f
ON i.column1 = f.column1
WHERE i.date ='${date}' AND (f.date='${date}')
)a
WHERE a.deltaValue is null OR a.deltaValue > 3 group by a.column1, a.column2, a.column3, a.column4, a.column5, a.column6;

When creating views, there is no MapReduce job triggered at all since this is only a metadata change. However, a proper MapReduce job will be triggered when querying the view. Use SHOW CREATE TABLE or DESC FORMATTED TABLE to display the CREATE VIEW statement that created a view. We can also ALTER VIEW to modify the properties of table and DROP VIEW to drop the view.

SerDe

SerDe stands for Serializer and Deserializer. It is the technology that Hive uses to process records and map them to column data types in Hive tables. To explain the scenario of using SerDe, we need to understand how Hive reads and writes data.

The process to read data is as follows:

1. Data is read from HDFS.

2. Data is processed by the INPUTFORMAT implementation, which defines the input data split and key/value records. In Hive, we can use CREATE TABLE… STORED AS <FILE_FORMAT> to specify which INPUTFORMAT it reads from.

3. The Java Deserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table.

For an example of reading data, we can use JSON SerDe to read the TEXTFILE format data from HDFS and translate each row of the JSON attribute and value to rows in Hive tables with the correct schema.

The process to write data is as follows

1. Data (such as using an INSERT statement) to be written is translated by the Serializer class defined in SerDe to the format that the OUTPUTFORMAT class can read.

2. Data is processed by the OUTPUTFORMAT implementation, which creates the RecordWriter object. Similar to the INPUTFORMAT implementation, the
OUTPUTFORMAT implementation is specified in the same way as a table where it writes the data.

3. The data is written to the table (data saved in the HDFS).

For example lets see how to define avro serde

This is the built-in SerDe that enables reading and writing Avro data in Hive tables. Avro is a remote procedure call and data serialization framework. Avro-backed tables can simply be created by using the CREATE TABLE …STORED AS AVRO statement, as follows:


CREATE TABLE test_serde_avro(
name string,
sex string,
age string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.
AvroContainerOutputFormat'

2 thoughts on “Hive tutorial 1 – hive internal and external table, hive ddl, hive partition, hive buckets and hive serializer and deserializer”

Leave a Reply

Your email address will not be published. Required fields are marked *