Elegant Data: Hadoop Definitive Guide Byte Sized Notes

Chapter 5 Map Reduce Process Flow

General dev process

The Configuration API

Collection of configuration properties and their values are instance of Configuration class (org.apache.hadoop.conf package)
Properties are stored in xml files
Default properties are stored in core-dafult.xml, there are multiple conf files for multiple settings
If mapred.job.tracker is set to local then execution is local job runner mode

Setting up Dev

Running on Cluster

Job's classes must be packaged to a jar file
setJarByClass() tells hadoop to find the jar file for your program
Use Ant or Maven to create jar file
On a cluster map and reduce tasks run in separate JVM
waitForCompletion() launches the job and polls for progress
Ex: 275GB of input data read from 34GB of compressed files broken into 101Gzip files
Typical Job id format: job_YYYYMMDDTTmm_XXXX
Tasks belong to Jobs, typical TaskID format task_Jobid_m_XXXXXX; m->;map, r->;reduce
Use MR WebUI to track Jobtracker page/ tasks page/nematode

Debugging Job

Use WebUI to analyse tasks & tasks detail page
Use custom counters
Hadoop Logs: Common logs for system daemons, HDFS audit, MapRed Job hist, MapRed task hist

Tuning Jobs

Can I make it run faster? Analysis done after job is run

More than one MapReduce & workflow management

Apache Oozie used for running dependent jobs, composed of workflow engine, coordinator engine (Lab)

unlike JobControl Oozie runs as a service
workflow is a DAG of action node and control flow nodes
Action node moving files in HDFS, MR, Pig, Hive
Control Flow coordinates actions nodes
Workflows written in XML and has three control-flow nodes and one action node: a start control node, a map-reduce action node, a kill control node, and an end control node.
All workflows must have one start and one end node.
Packaging, deploying and running oozie workflow job
Triggered by predicates usual time interval or date or external event

Elegant Data