row-oriented and column-oriented file formats in hadoop

Sequence files, map files, and Avro datafiles are all row-oriented file formats, which means that the values for each row are stored contiguously in the file.In a column oriented format, the rows in a file are broken up into row splits, then each split is stored in column-oriented fashion: the values for each row in the first column are stored first, followed by the values for each row in the second column, and so on.

A column-oriented layout permits columns that are not accessed in a query to be skipped. Consider a query of the table that processes only 1 column with row-oriented storage, like a sequence file, the whole row is loaded into memory, even though only the second column is actually read.With column-oriented storage, only the column which is queried need to be read into memory. In general, column-oriented formats work well when queries access only a small number of columns in the table. Conversely, row oriented formats are appropriate when a large number of columns of a single row are needed for processing at the same time.

Column-oriented formats need more memory for reading and writing, since they have to buffer a row split in memory, rather than just a single row. Also, it’s not usually possible to control when writes occur (via flush or sync operations), so column-oriented formats are not suited to streaming writes, as the current file cannot be recovered if the writer process fails. On the other hand, row-oriented formats like sequence files and Avro datafiles can be read up to the last sync point after a writer failure. It is for this reason that Flume uses row-oriented formats.

 

Leave a Reply

Your email address will not be published. Required fields are marked *