hdfs filesystem read and write concept using distributed file system

File Read in Hdfs

1. The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem .

2. DistributedFileSystem calls the namenode, using remote procedure calls, to determine the locations of the first few blocks in the file. For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client . If the client is itself a datanode in the case of a MapReduce job, for instance, the client will read from the local datanode if that datanode hosts a copy of the block.

3. The DistributedFileSystem returns an FSDataInputStream an input stream that supports file seeks to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.

4. The client then calls read() on the stream . DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first datanode for the first block in the file. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream . When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block.

5. Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream.

During reading, if the DFSInputStream encounters an error while communicating with a datanode, it will try the next closest one for that block. It will also remember datanodes that have failed so that it doesn’t needlessly retry them for later blocks. The DFSInputStream also verifies checksums for the data transferred to it from the datanode. If a corrupted block is found, the DFSInputStream attempts to read a replica of the block from another datanode. It also reports the corrupted block to the namenode.

One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the datanodes in the cluster. Meanwhile, the namenode merely has to service block location requests which it stores in memory, making them very efficient and does not, for example, serve data, which would quickly become a bottleneck as the number of clients grew.

File Write in hdfs

1. The client creates the file by calling create() on DistributedFileSystem.DistributedFileSystem makes an RPC call to the namenode to create a
new file in the filesystem’s namespace, with no blocks associated with it.

2.The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an
IOException.

3. The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.

4. As the client writes data, the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three.

5.The DataStreamer streams the packets to the first datanode in the pipeline, which stores each packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third datanode in the pipeline.

6. The DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.

7. When the client has finished writing data, it calls close() on the stream. This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete.

Replica Placement

1. First replica on the same node as the client and for clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy.

2. The second replica is placed on a different rack from the first (off-rack), chosen at random.

3. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes in the cluster, although the system tries to avoid placing too many replicas on the same rack.