A common way of avoiding data loss is through replication, storage - HDFS, analysis- MapReduce
Disk Drive
- Seek Time: the time taken to move the disk’s head to a particular location on the disk to read or write data, typical RDMBS use B-Tree which has good seek time
- Transfer Rate: streaming of data, disk bandwidth for Big Data apps we need high transfer rate, coz, its write once read many times
New in Hadoop 2.X
- HDFS HA
- HDFS Federation
- YARN MR2
- Secure Authentication
Chapter 2 MapReduce
i/p -> map -> shuffle -> sort -> reduce -> o/p
Some diff between MR1/MR2
- mapred -> mapreduce
- Context object unifies JobConf, OutputCollector, Reporter
- JobClient deprecated
- map o/p: part-m-nnnn, reduce o/p: part-r-nnnnn
- iterator to iterable
Data Locality
- To run a map task on the data node where the data resides
- map task writes output to local disk not HDFS
- No data locality for reduce, o/p of all mappers gets transferred to reduce node
- # of reduce tasks are not governed by size of i/p
- 0 reduce tasks are possible
Combiner
- performed on the output of the map task
- associate property max(0,20,10,15,25) = max( max(0,20,10), max(25,15))
- boosts the performance, pre filtering data before sent across the wire and reduce has smaller data set to work with
No comments:
Post a Comment