Elegant Data: December 2014

Monday, December 8, 2014

Hadoop Definitive Guide Byte Sized Notes - 6

Chapter 7 MapReduce Types and Formats

General form of map & reduce

map: (k1,v1) -> list(k2,v2)
reduce: (k2,list(v2)) ->list(k3,v3)
Keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase.
Reduce input must have the same types as the map output
The number of map tasks are not set, the number of map tasks are equal to the number of splits that the input is turned

Input formats

Data is passed to the mapper by InputFormat, Input Format is a factory of RecordReader object to extract the (key, value) pairs from input source
Each mapper deals with single input split no matter the file size
Input Split is the chunk of the input processed by a single map, represented by InputSplit
InputSplits created by InputFormat, FileInputFormat default for file types
Can process Text, XML, Binary

RecordReader

Ensure key, value pair is processed
Ensure (k, v) not processed more than once
Handle (k,v) which get split

Output formats: Text output, binary output

Chapter 8 MapReduce Features

Counters

Useful for QC, problem diagnostics
Two counter groups task & job

Task Counter

gather information about task level statistics over execution time
counts info like, input records, skipped records, output bytes etc
they are sent across the network

Job Counter

measure job level statistics

User defined counters

Counters are defined by a Java enum
Counters are global

Sorting

SortOrder controlled by RawComparator
Partial sort, secondary sort (Lab)

Joins

To run map-side join use CompositeInputFormat from the org.apache.hadoop.mapreduce.joinpackage

Side Data: Read only data required to process main data set, ex skip words for wordcount

Distributed Cache: To save network bandwidth, side data files are normally copied to any particular node once per job.

Map only Jobs: Image Processing, ETL, Select * from table, distcp, I/p data sampling

Partitioning

Map/reduce job will contains most of the time more than 1 reducer. So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers). A Partitioner is responsible to perform the partitioning.

For more info this article has detailed explanation

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Zero reducers only map tasks

One reducer when the amount of data is small enough to be processed comfortably by one reducer

Combiners

Used for reducing intermediate data
Runs locally on single mapper’s o/p
Best suited for commutative and associative operations
Code identical to Reducer
conf.setCombinerClass(Reducer.class);
May or may not run on the output from some or all mappers

Hadoop Definitive Guide Byte Sized Notes - 4

Chapter 5 Map Reduce Process Flow

General dev process

Write & test map & reduce functions in dev/local
Deploy job to cluster
Tune job later

The Configuration API

Collection of configuration properties and their values are instance of Configuration class (org.apache.hadoop.conf package)
Properties are stored in xml files
Default properties are stored in core-dafult.xml, there are multiple conf files for multiple settings
If mapred.job.tracker is set to local then execution is local job runner mode

Setting up Dev

Pseudo-distributed cluster is one whose daemons all run on the local machine
Use MRUnit for testing Map/Reduce
Write separate classes for cleaning the data
Tool, toolrunner helpful for debugging, parsing CLI

Running on Cluster

Job's classes must be packaged to a jar file
setJarByClass() tells hadoop to find the jar file for your program
Use Ant or Maven to create jar file
On a cluster map and reduce tasks run in separate JVM
waitForCompletion() launches the job and polls for progress
Ex: 275GB of input data read from 34GB of compressed files broken into 101Gzip files
Typical Job id format: job_YYYYMMDDTTmm_XXXX
Tasks belong to Jobs, typical TaskID format task_Jobid_m_XXXXXX; m->;map, r->;reduce
Use MR WebUI to track Jobtracker page/ tasks page/nematode

Debugging Job

Use WebUI to analyse tasks & tasks detail page
Use custom counters
Hadoop Logs: Common logs for system daemons, HDFS audit, MapRed Job hist, MapRed task hist

Tuning Jobs

Can I make it run faster? Analysis done after job is run

If mappers are taking less time, then reduce mappers, let each one run longer
# of reducers should be less than reduce slots
Combiners associate properties, reduce map output
Map output compression
Serialization other than Writeable
Shuffle tweaks
Profiling Tasks: HPROF profiling tool comes with JDK

More than one MapReduce & workflow management

ChainMapper/ChainReducer to run multiple maps
Linear chain job flow

JobClient.runJob(conf1);
JobClient.runJob(conf2);

Apache Oozie used for running dependent jobs, composed of workflow engine, coordinator engine (Lab)

unlike JobControl Oozie runs as a service
workflow is a DAG of action node and control flow nodes
Action node moving files in HDFS, MR, Pig, Hive
Control Flow coordinates actions nodes
Workflows written in XML and has three control-flow nodes and one action node: a start control node, a map-reduce action node, a kill control node, and an end control node.
All workflows must have one start and one end node.
Packaging, deploying and running oozie workflow job
Triggered by predicates usual time interval or date or external event