hdfs concept and command line interface

Filesystems that manage the storage across a network of machines are called distributed filesystems.Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

Hdfs works well on below scenario

1. Very large files – Files that are hundreds of megabytes, gigabytes, or terabytes in size.

2. Streaming data access – HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern.

3. Commodity hardware – Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware.

Hdfs is not suitable for below cases

1. Low-latency data access – Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS.

2. Lots of small files – Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. Although storing millions of files is feasible, billions is beyond the capability of current hardware.

3. Multiple writers, arbitrary file modifications – Files in HDFS may be written to by a single writer. Writes are always made at the end of the file, in append-only fashion. There is no support for multiple writers or for modifications at arbitrary offsets in the file.

Blocks

HDFS has the concept of a block which is 128 MB by default. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. (For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.).HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.

The advantage of using the block abstraction is a file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.Also its helpfull in replication for providing fault tolerance and availability.To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

Namenodes and Datanodes

An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located; however, it does not store block locations persistently, because this information is reconstructed from datanodes when the system starts.

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this.

It is also possible to run a secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine because it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain.

The namenode keeps a reference to every file and block in the filesystem in memory, which means that on very large clusters with many files, memory becomes the limiting factor for scaling . HDFS federation, introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share.

Hadoop 2 remedied the situation of data loss due to name node failure  by adding support for HDFS high availability (HA). In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption. A few architectural changes are needed to allow this to happen:

1. The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up, it reads up to the end of the shared edit log to synchronize its state with the active namenode, and then continues to read new entries as they are written by the active namenode.

2. Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.

3. Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.

4. The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace.

The Command-Line Interface

The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others.Below are the commands supported

For complete documentation please refer the link FileSystemShell.html

appendToFile

hadoop fs -appendToFile /home/testuser/test/test.txt /user/haas_queue/test/test.txt

append the content of the /home/testuser/test/test.txt to the /user/haas_queue/test/test.txt in the hdfs.

cat

Copies source paths to stdout.

hadoop fs -cat hdfs://nameservice1/user/haas_queue/test/test.txt

checksum

Returns the checksum information of a file.

hadoop fs -checksum hdfs://nameservice1/user/haas_queue/test/test.txt

chgrp

Usage: hadoop fs -chgrp [-R] GROUP URI [URI …]

Change group association of files. The user must be the owner of files, or else a super-user. Additional information is in the Permissions Guide HdfsPermissionsGuide.html

Options

The -R option will make the change recursively through the directory structure.

chmod

hadoop fs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI …]

Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the Permissions guide HdfsPermissionsGuide.html.

Options

The -R option will make the change recursively through the directory structure.

chown

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Change the owner of files. The user must be a super-user. Additional information is in the Permissions Guide HdfsPermissionsGuide.html

Options

The -R option will make the change recursively through the directory structure.

copyFromLocal

hadoop fs -copyFromLocal /home/sa081876/test/* /user/haas_queue/test

This command copies all the files inside test folder in the edge node to test folder in the hdfs

Similar to put command, except that the source is restricted to a local file reference.

Options:

The -f option will overwrite the destination if it already exists.

copyToLocal

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

hadoop fs -copyFromLocal /user/haas_queue/test/* /home/sa081876/test

This command copies all the files inside test folder in the hdfs to test folder in the edge node.

Similar to get command, except that the destination is restricted to a local file reference.

count

hadoop fs -count [-q] [-h] [-v] <paths>

Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns with -count are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The output columns with -count -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The -h option shows sizes in human readable format.

The -v option displays a header line.

Example:

hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hdfs dfs -count -q -h -v hdfs://nn1.example.com/file1

cp

Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI …] <dest>

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.

Options:

The -f option will overwrite the destination if it already exists.
The -p option will preserve file attributes [topx] (timestamps, ownership, permission, ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership, permission. If -pa is specified, then preserves permission also because ACL is a super-set of permission. Determination of whether raw namespace extended attributes are preserved is independent of the -p flag.

hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

createSnapshot

HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery. For more information refer the link HdfsSnapshots.html

hdfs dfs -createSnapshot <path> [<snapshotName>]

path – The path of the snapshottable directory.

snapshotName – The snapshot name, which is an optional argument. When it is omitted, a default name is generated using a timestamp with the format “‘s’yyyyMMdd-HHmmss.SSS”, e.g. “s20130412-151029.033”

deleteSnapshot

Delete a snapshot of from a snapshottable directory. This operation requires owner privilege of the snapshottable directory.For more information refer the link HdfsSnapshots.html

hdfs dfs -deleteSnapshot <path> <snapshotName>

path – The path of the snapshottable directory.
snapshotName – The snapshot name.

df

Displays free space

hadoop fs -df [-h] URI [URI …]

Options:

The -h option will format file sizes in a human-readable fashion.
Example:

hadoop fs -df /user/hadoop/dir1

du

hadoop fs -du [-s] [-h] URI [URI …]

Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Options:

The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)
Example:

hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1

expunge

Empty the Trash.For more info refer the link HdfsDesign.html

hadoop fs -expunge

find

hadoop fs -find <path> … <expression> …

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

hadoop fs -find / -name test -print

get

hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.

Example:

hadoop fs -get /user/hadoop/file localfile
hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

getfacl

hadoop fs -getfacl [-R] <path>

Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.

Options:

-R: List the ACLs of all files and directories recursively.
path: File or directory to list.
Examples:

hadoop fs -getfacl /file
hadoop fs -getfacl -R /dir

getfattr

hadoop fs -getfattr [-R] -n name | -d [-e en] <path>

Displays the extended attribute names and values (if any) for a file or directory.

Options:

-R: Recursively list the attributes for all files and directories.
-n name: Dump the named extended attribute value.
-d: Dump all extended attribute values associated with pathname.
-e encoding: Encode values after retrieving them. Valid encodings are “text”, “hex”, and “base64”. Values encoded as text strings are enclosed in double quotes (“), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
path: The file or directory.
Examples:

hadoop fs -getfattr -d /file
hadoop fs -getfattr -R -n user.myAttr /dir

getmerge

hadoop fs -getmerge [-nl] <src> <localdst>

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.

Examples:

hadoop fs -getmerge -nl /src /opt/output.txt

help

hadoop fs -help

Return usage output.

ls

list files

hadoop fs -ls [-d] [-h] [-R] <args>

-d: Directories are listed as plain files.
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
-R: Recursively list subdirectories encountered.

lsr

Recursive version of ls.

hadoop fs -lsr <args>

mkdir

hadoop fs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

Options:

The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

moveFromLocal

hadoop fs -moveFromLocal <localsrc> <dst>

Similar to put command, except that the source localsrc is deleted after it’s copied.

moveToLocal

hadoop fs -moveToLocal [-crc] <src> <dst>

Displays a “Not implemented yet” message.

mv

hadoop fs -mv URI [URI …] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

put

hadoop fs -put <localsrc> … <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.

renameSnapshot

Rename a snapshot. This operation requires owner privilege of the snapshottable directory.

hdfs dfs -renameSnapshot <path> <oldName> <newName>

path The path of the snapshottable directory.
oldName The old snapshot name.
newName The new snapshot name.

rm

hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI …]

Delete files specified as args.

Options:

The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.
The -R option deletes the directory and any content under it recursively.
The -r option is equivalent to -R.
The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.
Example:

hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

rmdir

hadoop fs -rmdir [–ignore-fail-on-non-empty] URI [URI …]

Delete a directory.

rmr

hadoop fs -rmr [-skipTrash] URI [URI …]

Recursive version of delete.

setfacl

hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[–set <acl_spec> <path>]

Sets Access Control Lists (ACLs) of files and directories.

setfattr

hadoop fs -setfattr -n name [-v value] | -x name <path>

Sets an extended attribute name and value for a file or directory.

setrep

hadoop fs -setrep [-R] [-w] <numReplicas> <path>

Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.

stat

hadoop fs -stat [format] <path> …

Print statistics about the file/directory at <path> in the specified format.

tail

hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

test

hadoop fs -test -[defsz] URI

text

hadoop fs -text <src>

Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.

touchz

hadoop fs -touchz URI [URI …]

Create a file of zero length.

truncate

hadoop fs -truncate [-w] <length> <paths>

Truncate all files that match the specified file pattern to the specified length.

usage

hadoop fs -usage command

Return the help for an individual command.