Hive tutorial 3 – hive load, insert, export and import

Data exchange Load

Load local data to the Hive table. The LOCAL keyword specifies where the files are located in the host. If the LOCAL keyword is not specified, the files are loaded from the full Uniform Resource Identifier (URI) specified after INPATH or the value from the fs.default.name Hive property by default.The path either points to a file or a folder (all files in the folder) to be loaded, but the subfolder is not allowed in the path specified. If the data is loaded into a partition table, the partition column must be specified. The OVERWRITE keyword is used to decide whether to append or replace the existing data in the target table/partition.


LOAD DATA LOCAL INPATH '/home/service.txt' OVERWRITE INTO TABLE service_table;

Load local data to the Hive partition table


LOAD DATA LOCAL INPATH '/home/service.txt' OVERWRITE INTO TABLE service_table PARTITION (date='2017/11/05');

Load HDFS data to the Hive table using the default system path

LOAD DATA INPATH '/home/service.txt' OVERWRITE INTO TABLE service_table;

Load HDFS data to the Hive table with full URI


LOAD DATA INPATH ''hdfs://host:8020/user/service_table.txt'' OVERWRITE INTO TABLE service_table;
INSERT

To extract the data from Hive tables/ partitions, we can use the INSERT keyword. Like RDBMS, Hive supports inserting data by selecting data from other tables. This is a very common way to populate a table from existing data. Hive has improved its INSERT statement by supporting OVERWRITE, multiple INSERT, dynamic partition INSERT, as well as using INSERT to files.


set hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE service_table PARTITION(datestamp)
select ele_id,sub_id,ser_num,date as datestamp from service_table_old where date='${datestamp}';

INSERT INTO TABLE service_table PARTITION(datestamp)
select ele_id,sub_id,ser_num,date as datestamp from service_table_old where date='${datestamp}';

When inserting data to the partitions, we need to specify the partition columns. Instead of specifying static values for static partitions, Hive also supports dynamically giving partition values. Dynamic partitions are useful when the data volume is large and we don’t know what will be the partition values. Dynamic partition is not enabled by default and we can enable it using


SET hive.exec.dynamic.partition=true;

By default, the user must specify at least one static partition column. This is to avoid accidentally overwriting partitions. To disable this restriction, we can set the partition mode to nonstrict from the default strict mode before inserting into dynamic partitions

set hive.exec.dynamic.partition.mode=nonstrict;

The Hive INSERT to files statement is the opposite operation for LOAD. It extracts the data from SELECT statements to local or HDFS files. However, it only supports the OVERWRITE keyword, not INTO. This means we cannot append data extracted to the existing files.Command with specified row separators


INSERT OVERWRITE LOCAL DIRECTORY '/temp/location/output'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT * FROM service_table;

By default, many partial files could be created by the reducer when doing INSERT. To merge them into one, we can use HDFS commands getmerge which takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.

hdfs dfs -getmerge <src> <localdst> [addnl]

EXPORT and IMPORT

EXPORT and IMPORT statements are available to support the import and export of data in HDFS for data migration or backup/restore purposes. The EXPORT statement will export both data and metadata from a table or partition. Metadata is exported in a file called _metadata. Data is exported in a subdirectory called data.


EXPORT TABLE service_table TO '/user/output';

After EXPORT, we can manually copy the exported files to other Hive instances or use Hadoop distcp commands to copy to other HDFS clusters. Then, we can import the data in the following manner

Import data to a new table


IMPORT TABLE service_table_imported FROM '/user/output';

Import data to an external table, where the LOCATION property is optional:


IMPORT EXTERNAL TABLE service_table_external
FROM '/user/output3'
LOCATION '/user/imported/output4' ;