Monday, December 8, 2014

Hadoop Definitive Guide Byte Sized Notes - 4

Chapter 5 Map Reduce Process Flow

General dev process
  • Write & test map & reduce functions in dev/local
  • Deploy job to cluster
  • Tune job later
The Configuration API
  • Collection of configuration properties and their values are instance of Configuration class (org.apache.hadoop.conf package)
  • Properties are stored in xml files
  • Default properties are stored in core-dafult.xml, there are multiple conf files for multiple settings
  • If mapred.job.tracker is set to local then execution is local job runner mode
Setting up Dev
  • Pseudo-distributed cluster is one whose daemons all run on the local machine
  • Use MRUnit for testing Map/Reduce
  • Write separate classes for cleaning the data 
  • Tool, toolrunner helpful for debugging, parsing CLI
Running on Cluster
  • Job's classes must be packaged to a jar file
  • setJarByClass() tells hadoop to find the jar file for your program
  • Use Ant or Maven to create jar file
  • On a cluster map and reduce tasks run in separate JVM
  • waitForCompletion() launches the job and polls for progress
  • Ex: 275GB of input data read from 34GB of compressed files broken into 101Gzip files
  • Typical Job id format: job_YYYYMMDDTTmm_XXXX
  • Tasks belong to Jobs, typical TaskID format task_Jobid_m_XXXXXX; m->;map, r->;reduce
  • Use MR WebUI to track Jobtracker page/ tasks page/nematode
Debugging Job
  • Use WebUI to analyse tasks & tasks detail page 
  • Use custom counters
  • Hadoop Logs: Common logs for system daemons, HDFS audit, MapRed Job hist, MapRed task hist
Tuning Jobs

Can I make it run faster? Analysis done after job is run
  • If mappers are taking less time, then reduce mappers, let each one run longer
  • # of reducers should be less than reduce slots 
  • Combiners associate properties, reduce map output
  • Map output compression
  • Serialization other than Writeable
  • Shuffle tweaks
  • Profiling Tasks: HPROF profiling tool comes with JDK
More than one MapReduce & workflow management

  • ChainMapper/ChainReducer to run multiple maps
  • Linear chain job flow 
    • JobClient.runJob(conf1);
    • JobClient.runJob(conf2);
  • Apache Oozie used for running dependent jobs, composed of workflow engine, coordinator engine (Lab)
    • unlike JobControl Oozie runs as a service 
    • workflow is a DAG of action node and control flow nodes
    • Action node moving files in HDFS, MR, Pig, Hive
    • Control Flow coordinates actions nodes
    • Workflows written in XML and has three control-flow nodes and one action node: a start control node, a map-reduce action node, a kill control node, and an end control node. 
    • All workflows must have one start and one end node.
    • Packaging, deploying and running oozie workflow job
    • Triggered by predicates usual time interval or date or external event 


No comments:

Post a Comment