Saturday, October 25, 2014

Hadoop Definitive Guide Byte Sized Notes - 3

Chapter 4 Hadoop IO
 
Data Integrity in Hadoop
  • For data integrity usually CRC-32 is used but with low end hardware it could be checksum which is corrupt not the data
  • Datanodes are responsible for verifying the data they receive before storing the data and its checksum
  • Applies to data that they receive from clients and from other datanodes during replication
  • Each datanode keeps a persistent log of checksum verifications
LocalFileSystem
  • Hadoop LocalFileSystem performs client-side check summing
  • Creates hidden .filename.crc
  • To disable checksum use RawLocalFileSystem
  • FileSystem fs = new RawLocalFileSystem();
Compression
  • More space to store files, faster data transfer across network
  • Splittable shows whether the compression format supports splitting, that is, whether you can seek to any point in the stream and start reading from some point further on
  • Compression algo have space/time trade off
    • Gzip sits in middle
    • BZip compression better than Gzip but slow
    • LZO, LZ4,Snappy are faster
    • BZip is only splittable
Codecs
  • A codec is the implementation of a compression-decompression algorithm
  • We can compress file before writing, for reading a compressed file, CompressionCodecFactory is used to find file name extension and appropriate algo is used
Compression and Input Splits
  • Use container file format such as Sequence, RCFile or Avro
  • Use compression algo which support splitting
Using Compression in MapReduce
  • To compress the output of a MapReduce job, in the job configuration, set the mapred.output.compress property to true
  • We can compress output of a map
Serialization
  • Serialization process of converting objects to byte stream for transmitting over network, De-serialization is the opposite
  • In Hadoop, inter-process communication between nodes is implemented, using remote procedure calls(RPCs), RPC serialization should be compact, fast, extensible, and interoperable
  • Hadoop uses its own serialization format, Writables, Avro is used to overcome limitation of Writables interface
  • The Writable Interface 
            Interface for serialization
            IntWritable wrapper for Java int, use set() to assign value
            IntWritable writable = new IntWritable();
            writable.set(163);
            or 
            IntWritable writable = new IntWritable(163);
  • For writing we use DataOutput binary stream and for reading we use DataInput binary stream
  • DataOutput convert 163 into byte array, to de serialize the byte array is converted back to IntWritable
  • WritableComparable and comparators
    • Comparison of types is crucial for MapReduce
    • IntWritable implements the WritableComparable interface
    • This interface permits users to compare records read from a stream without de serializing them into objects (no overhead of object creation)
    • keys must be WritableComparable because they are passed to reducer in sorted order
    • There are wrappers for most data types in Java, boolean, byte, short, int, float, long, double, text is UTF-8 sequence
    • BytesWritable is a wrapper for an array of binary data
    • NullWritableis a special type of Writable, as it has a zero-length serialization. No bytes are written to or read from the stream
    • ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these type
  • Writable collections
    • There are six Writablecollection types, ex for 2D array use TwoDArrayWritable
    • We can create custom writable class and it can implements WritableComparable (Lab)
Serialization Frameworks
  • Hadoop has an API for pluggable serialization frameworks
Avro
  • Language-neutral data serialization system
Why Avro?
  • Written in JSON, encoded in binary
  • Language-independent schema has rich schema resolution capabilities
  • .avsc is the conventional extension for an Avro schema, .avro is data
  • Avro’s language interoperability
File-Based Data Structures
  • Special data structure to hold data
SequenceFile
  • Binary encoded key, value pairs
  • To save log files text is not suitable
  • Useful for storing binary key value pairs, .seq ext
  • SequenceFileInputFormat Binary file (k,v)
  • SequenceFileAsTextInputFormat (key.toString(), value.toString())
MapFile
  • Sorted SequenceFilewith an index to permit lookups by key
  • MapFile.Writer to write, .map ext
  • Numbers.map/index contains keys
  • Variation of mapfile
    • SetFile for storing Writable keys
    • ArrayFile int index
    • BloomMapFile fast get()
  • Fix() method is usually used for recreating corrupted indexes 
  • Sequence files can be converted to mapfiles 

Monday, October 20, 2014

Hadoop Definitive Guide Byte Sized Notes - 2

Chapter 3 HDFS
  • Hadoop can be integrated with local file system, S3 etc
  • Writes always happen at the end of the file, cannot append  data in between
  • Default block size 64 MB
Name Node
  • Name node holds the file system metadata, namespace and tree  in memory, without this metadata there is no way to access files in cluster 
  • Stores this info on local disk in 2 files name space image & edit log
  • NN keeps track of which blocks makes up a file
  • Block locations are dynamically updates
  • NN must be running at all times
  • When clients reads a file NN will not be a bottleneck
Secondary Name Node 
  • Stores the partial of name node’s namespace image and edit log
  • In case of name node failure we can merge namespace image and edit log 
HDFS Federation
  • Clusters can scale by adding name nodes, each NN manages a portion of file system namespace 
  • Namespace volumes are independent, Name Nodes don’t communicate with each other and doesn’t affect the availability of other name nodes
High Availability 
  • Pair of NN in active-stand by config
  • NN’s share use shared storage to share edit log
  • DN send block reports for both the NN
Fail Over
  • NN runs a process to monitor failures, if failover detected then it gracefully terminates
Fencing
  • Previously active name node is prevented from doing damage following will help prevent corruption of active name node
    • Disable the network port
    • Kill all process
    • Revoke access to shared storage
File System
  • Cannot execute a file in HDFS
  • Replication not applicable to directories 
  • Globbing wild cards to match multiple files in a directory, sample file selection: input/ncdc/all/190{1,2}.gz

Reading Data 
  • FileSystemAPI contains static factory methods to open i/p stream
  • A file in HDFS is path object
  • Default file system properties are stored in conf/core-site.xml  
    • FSDataInputStream
      • FileSystem uses FSDataInputStream object to open a file
      • positioned readable, seek() expensive
  • If FSDataInputStream fails then sends error to NN  and  NN sends the next closet block address
Writing Data
  • FSDataOutputStream main class
  • If FSDataOutputStream fails then pipeline is closed, partial blocks are deleted, bad data node removed from pipeline, NN replicates under replicated blocks
Replica Placement
  • 1st replica at random node, 2nd replica off rack to random node, 3rd replica same rack as 2nd but diff node, applicable to reducers output as well
Coherency 
  • Same data across all replicated data nodes, sync()
HAR Hadoop Archives
  • Typically used for small data sets
  • Packs all small blocks efficiently 
  • Archive tool
  • Limitations
    • No archive compression
    • Immutable to add or remove recreate HAR

Friday, October 17, 2014

Hadoop Definitive Guide Byte Sized Notes - 1

Chapter 1

A common way of avoiding data loss is through replication, storage - HDFS, analysis- MapReduce

Disk Drive

  • Seek Time: the time taken to move the disk’s head to a particular location on the disk to read or write data, typical RDMBS use B-Tree which has good seek time
  • Transfer Rate: streaming of data, disk bandwidth for Big Data apps we need high transfer rate, coz, its write once read many times

New in Hadoop 2.X
  • HDFS HA
  • HDFS Federation
  • YARN MR2
  • Secure Authentication
Chapter 2 MapReduce

i/p -> map -> shuffle -> sort -> reduce -> o/p

Some diff between MR1/MR2
  1. mapred -> mapreduce
  2. Context object unifies JobConf, OutputCollector, Reporter
  3. JobClient deprecated
  4. map o/p: part-m-nnnn, reduce o/p:  part-r-nnnnn
  5. iterator to iterable  
Data Locality
  • To run a map task on the data node where the data resides
  • map task writes output to local disk not HDFS
  • No data locality for reduce, o/p of all mappers gets transferred to reduce node
  • # of reduce tasks are not governed by size of i/p
  • 0 reduce tasks are possible
Combiner
  • performed on the output of the map task
  • associate property max(0,20,10,15,25) = max( max(0,20,10), max(25,15))
    • boosts the performance, pre filtering data before sent across the wire and reduce has smaller data set to work with