Chapter 7 MapReduce Types and Formats
General form of map & reduce
- map: (k1,v1) -> list(k2,v2)
- reduce: (k2,list(v2)) ->list(k3,v3)
- Keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase.
- Reduce input must have the same types as the map output
- The number of map tasks are not set, the number of map tasks are equal to the number of splits that the input is turned
Input formats
- Data is passed to the mapper by InputFormat, Input Format is a factory of RecordReader object to extract the (key, value) pairs from input source
- Each mapper deals with single input split no matter the file size
- Input Split is the chunk of the input processed by a single map, represented by InputSplit
- InputSplits created by InputFormat, FileInputFormat default for file types
- Can process Text, XML, Binary
RecordReader
- Ensure key, value pair is processed
- Ensure (k, v) not processed more than once
- Handle (k,v) which get split
Output formats: Text output, binary output
Chapter 8 MapReduce Features
Counters
- Useful for QC, problem diagnostics
- Two counter groups task & job
- Task Counter
- gather information about task level statistics over execution time
- counts info like, input records, skipped records, output bytes etc
- they are sent across the network
- Job Counter
- measure job level statistics
- User defined counters
- Counters are defined by a Java enum
- Counters are global
Sorting
- SortOrder controlled by RawComparator
- Partial sort, secondary sort (Lab)
Joins
To run map-side join use CompositeInputFormat from the org.apache.hadoop.mapreduce.joinpackage
Side Data: Read only data required to process main data set, ex skip words for wordcount
Distributed Cache: To save network bandwidth, side data files are normally copied to any particular node once per job.
Map only Jobs: Image Processing, ETL, Select * from table, distcp, I/p data sampling
Partitioning
Map/reduce job will contains most of the time more than 1 reducer. So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers). A Partitioner is responsible to perform the partitioning.
For more info this article has detailed explanation
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
Zero reducers only map tasks
- One reducer when the amount of data is small enough to be processed comfortably by one reducer
Combiners
- Used for reducing intermediate data
- Runs locally on single mapper’s o/p
- Best suited for commutative and associative operations
- Code identical to Reducer
- conf.setCombinerClass(Reducer.class);
- May or may not run on the output from some or all mappers