Friday, October 17, 2014

Hadoop Definitive Guide Byte Sized Notes - 1

Chapter 1

A common way of avoiding data loss is through replication, storage - HDFS, analysis- MapReduce

Disk Drive

  • Seek Time: the time taken to move the disk’s head to a particular location on the disk to read or write data, typical RDMBS use B-Tree which has good seek time
  • Transfer Rate: streaming of data, disk bandwidth for Big Data apps we need high transfer rate, coz, its write once read many times

New in Hadoop 2.X
  • HDFS HA
  • HDFS Federation
  • YARN MR2
  • Secure Authentication
Chapter 2 MapReduce

i/p -> map -> shuffle -> sort -> reduce -> o/p

Some diff between MR1/MR2
  1. mapred -> mapreduce
  2. Context object unifies JobConf, OutputCollector, Reporter
  3. JobClient deprecated
  4. map o/p: part-m-nnnn, reduce o/p:  part-r-nnnnn
  5. iterator to iterable  
Data Locality
  • To run a map task on the data node where the data resides
  • map task writes output to local disk not HDFS
  • No data locality for reduce, o/p of all mappers gets transferred to reduce node
  • # of reduce tasks are not governed by size of i/p
  • 0 reduce tasks are possible
Combiner
  • performed on the output of the map task
  • associate property max(0,20,10,15,25) = max( max(0,20,10), max(25,15))
    • boosts the performance, pre filtering data before sent across the wire and reduce has smaller data set to work with

No comments:

Post a Comment