Elegant Data: Hadoop Definitive Guide Byte Sized Notes - 1

Friday, October 17, 2014

Hadoop Definitive Guide Byte Sized Notes - 1

Chapter 1

A common way of avoiding data loss is through replication, storage - HDFS, analysis- MapReduce

Disk Drive

Seek Time: the time taken to move the disk’s head to a particular location on the disk to read or write data, typical RDMBS use B-Tree which has good seek time

Transfer Rate: streaming of data, disk bandwidth for Big Data apps we need high transfer rate, coz, its write once read many times

New in Hadoop 2.X

HDFS HA
HDFS Federation
YARN MR2
Secure Authentication

Chapter 2 MapReduce

i/p -> map -> shuffle -> sort -> reduce -> o/p

Some diff between MR1/MR2

mapred -> mapreduce
Context object unifies JobConf, OutputCollector, Reporter
JobClient deprecated
map o/p: part-m-nnnn, reduce o/p: part-r-nnnnn
iterator to iterable

Data Locality

To run a map task on the data node where the data resides
map task writes output to local disk not HDFS
No data locality for reduce, o/p of all mappers gets transferred to reduce node
# of reduce tasks are not governed by size of i/p
0 reduce tasks are possible

Combiner

performed on the output of the map task
associate property max(0,20,10,15,25) = max( max(0,20,10), max(25,15))
- boosts the performance, pre filtering data before sent across the wire and reduce has smaller data set to work with

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)