Elegant Data: Hadoop Definitive Guide Byte Sized Notes

Chapter 7 MapReduce Types and Formats

General form of map & reduce

map: (k1,v1) -> list(k2,v2)
reduce: (k2,list(v2)) ->list(k3,v3)
Keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase.
Reduce input must have the same types as the map output
The number of map tasks are not set, the number of map tasks are equal to the number of splits that the input is turned

Input formats

Data is passed to the mapper by InputFormat, Input Format is a factory of RecordReader object to extract the (key, value) pairs from input source
Each mapper deals with single input split no matter the file size
Input Split is the chunk of the input processed by a single map, represented by InputSplit
InputSplits created by InputFormat, FileInputFormat default for file types
Can process Text, XML, Binary

RecordReader

Ensure key, value pair is processed
Ensure (k, v) not processed more than once
Handle (k,v) which get split

Output formats: Text output, binary output

Chapter 8 MapReduce Features

Counters

Useful for QC, problem diagnostics
Two counter groups task & job

Task Counter

gather information about task level statistics over execution time
counts info like, input records, skipped records, output bytes etc
they are sent across the network

Job Counter

measure job level statistics

User defined counters

Counters are defined by a Java enum
Counters are global

Sorting

SortOrder controlled by RawComparator
Partial sort, secondary sort (Lab)

Joins

To run map-side join use CompositeInputFormat from the org.apache.hadoop.mapreduce.joinpackage

Side Data: Read only data required to process main data set, ex skip words for wordcount

Distributed Cache: To save network bandwidth, side data files are normally copied to any particular node once per job.

Map only Jobs: Image Processing, ETL, Select * from table, distcp, I/p data sampling

Partitioning

Map/reduce job will contains most of the time more than 1 reducer. So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers). A Partitioner is responsible to perform the partitioning.

For more info this article has detailed explanation

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Zero reducers only map tasks

One reducer when the amount of data is small enough to be processed comfortably by one reducer

Combiners

Used for reducing intermediate data
Runs locally on single mapper’s o/p
Best suited for commutative and associative operations
Code identical to Reducer
conf.setCombinerClass(Reducer.class);
May or may not run on the output from some or all mappers

Elegant Data

Monday, December 8, 2014

Hadoop Definitive Guide Byte Sized Notes - 6

No comments:

Post a Comment