Elegant Data: Hadoop Definitive Guide Byte Sized Notes

Chapter 4 Hadoop IO

Data Integrity in Hadoop

For data integrity usually CRC-32 is used but with low end hardware it could be checksum which is corrupt not the data
Datanodes are responsible for verifying the data they receive before storing the data and its checksum
Applies to data that they receive from clients and from other datanodes during replication
Each datanode keeps a persistent log of checksum verifications

LocalFileSystem

Compression

More space to store files, faster data transfer across network
Splittable shows whether the compression format supports splitting, that is, whether you can seek to any point in the stream and start reading from some point further on
Compression algo have space/time trade off

Codecs

A codec is the implementation of a compression-decompression algorithm
We can compress file before writing, for reading a compressed file, CompressionCodecFactory is used to find file name extension and appropriate algo is used

Compression and Input Splits

Using Compression in MapReduce

To compress the output of a MapReduce job, in the job configuration, set the mapred.output.compress property to true
We can compress output of a map

Serialization

Serialization process of converting objects to byte stream for transmitting over network, De-serialization is the opposite
In Hadoop, inter-process communication between nodes is implemented, using remote procedure calls(RPCs), RPC serialization should be compact, fast, extensible, and interoperable
Hadoop uses its own serialization format, Writables, Avro is used to overcome limitation of Writables interface
The Writable Interface

Interface for serialization

IntWritable wrapper for Java int, use set() to assign value

IntWritable writable = new IntWritable();

writable.set(163);

IntWritable writable = new IntWritable(163);

For writing we use DataOutput binary stream and for reading we use DataInput binary stream
DataOutput convert 163 into byte array, to de serialize the byte array is converted back to IntWritable
WritableComparable and comparators

Comparison of types is crucial for MapReduce
IntWritable implements the WritableComparable interface
This interface permits users to compare records read from a stream without de serializing them into objects (no overhead of object creation)
keys must be WritableComparable because they are passed to reducer in sorted order
There are wrappers for most data types in Java, boolean, byte, short, int, float, long, double, text is UTF-8 sequence
BytesWritable is a wrapper for an array of binary data
NullWritableis a special type of Writable, as it has a zero-length serialization. No bytes are written to or read from the stream
ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these type

There are six Writablecollection types, ex for 2D array use TwoDArrayWritable
We can create custom writable class and it can implements WritableComparable (Lab)

Serialization Frameworks

Avro

Why Avro?

File-Based Data Structures

SequenceFile

MapFile

Elegant Data