A common way of avoiding data loss is through replication, storage - HDFS, analysis- MapReduce
Disk Drive
- Seek Time: the time taken to move the disk’s head to a particular location on the disk to read or write data, typical RDMBS use B-Tree which has good seek time
 
- Transfer Rate: streaming of data, disk bandwidth for Big Data apps we need high transfer rate, coz, its write once read many times
 
New in Hadoop 2.X
- HDFS HA
 - HDFS Federation
 - YARN MR2
 - Secure Authentication
 
Chapter 2 MapReduce
i/p -> map -> shuffle -> sort -> reduce -> o/p
Some diff between MR1/MR2
- mapred -> mapreduce
 - Context object unifies JobConf, OutputCollector, Reporter
 - JobClient deprecated
 - map o/p: part-m-nnnn, reduce o/p: part-r-nnnnn
 - iterator to iterable
 
Data Locality
- To run a map task on the data node where the data resides
 - map task writes output to local disk not HDFS
 - No data locality for reduce, o/p of all mappers gets transferred to reduce node
 - # of reduce tasks are not governed by size of i/p
 - 0 reduce tasks are possible
 
Combiner
- performed on the output of the map task
 - associate property max(0,20,10,15,25) = max( max(0,20,10), max(25,15))
 - boosts the performance, pre filtering data before sent across the wire and reduce has smaller data set to work with
 
No comments:
Post a Comment