Chapter 4 Hadoop IO
Data Integrity in Hadoop
- For data integrity usually CRC-32 is used but with low end hardware it could be checksum which is corrupt not the data
- Datanodes are responsible for verifying the data they receive before storing the data and its checksum
- Applies to data that they receive from clients and from other datanodes during replication
- Each datanode keeps a persistent log of checksum verifications
LocalFileSystem
- Hadoop LocalFileSystem performs client-side check summing
- Creates hidden .filename.crc
- To disable checksum use RawLocalFileSystem
- FileSystem fs = new RawLocalFileSystem();
Compression
- More space to store files, faster data transfer across network
- Splittable shows whether the compression format supports splitting, that is, whether you can seek to any point in the stream and start reading from some point further on
- Compression algo have space/time trade off
- Gzip sits in middle
- BZip compression better than Gzip but slow
- LZO, LZ4,Snappy are faster
- BZip is only splittable
Codecs
- A codec is the implementation of a compression-decompression algorithm
- We can compress file before writing, for reading a compressed file, CompressionCodecFactory is used to find file name extension and appropriate algo is used
Compression and Input Splits
- Use container file format such as Sequence, RCFile or Avro
- Use compression algo which support splitting
Using Compression in MapReduce
- To compress the output of a MapReduce job, in the job configuration, set the mapred.output.compress property to true
- We can compress output of a map
Serialization
- Serialization process of converting objects to byte stream for transmitting over network, De-serialization is the opposite
- In Hadoop, inter-process communication between nodes is implemented, using remote procedure calls(RPCs), RPC serialization should be compact, fast, extensible, and interoperable
- Hadoop uses its own serialization format, Writables, Avro is used to overcome limitation of Writables interface
- The Writable Interface
Interface for serialization
IntWritable wrapper for Java int, use set() to assign value
IntWritable writable = new IntWritable();
writable.set(163);
or
IntWritable writable = new IntWritable(163);
- For writing we use DataOutput binary stream and for reading we use DataInput binary stream
- DataOutput convert 163 into byte array, to de serialize the byte array is converted back to IntWritable
- WritableComparable and comparators
- Comparison of types is crucial for MapReduce
- IntWritable implements the WritableComparable interface
- This interface permits users to compare records read from a stream without de serializing them into objects (no overhead of object creation)
- keys must be WritableComparable because they are passed to reducer in sorted order
- There are wrappers for most data types in Java, boolean, byte, short, int, float, long, double, text is UTF-8 sequence
- BytesWritable is a wrapper for an array of binary data
- NullWritableis a special type of Writable, as it has a zero-length serialization. No bytes are written to or read from the stream
- ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these type
- Writable collections
- There are six Writablecollection types, ex for 2D array use TwoDArrayWritable
- We can create custom writable class and it can implements WritableComparable (Lab)
Serialization Frameworks
- Hadoop has an API for pluggable serialization frameworks
Avro
- Language-neutral data serialization system
Why Avro?
- Written in JSON, encoded in binary
- Language-independent schema has rich schema resolution capabilities
- .avsc is the conventional extension for an Avro schema, .avro is data
- Avro’s language interoperability
File-Based Data Structures
- Special data structure to hold data
SequenceFile
- Binary encoded key, value pairs
- To save log files text is not suitable
- Useful for storing binary key value pairs, .seq ext
- SequenceFileInputFormat Binary file (k,v)
- SequenceFileAsTextInputFormat (key.toString(), value.toString())
MapFile
- Sorted SequenceFilewith an index to permit lookups by key
- MapFile.Writer to write, .map ext
- Numbers.map/index contains keys
- Variation of mapfile
- SetFile for storing Writable keys
- ArrayFile int index
- BloomMapFile fast get()
- Fix() method is usually used for recreating corrupted indexes
- Sequence files can be converted to mapfiles