Monday, December 8, 2014

Hadoop Definitive Guide Byte Sized Notes - 6

Chapter 7 MapReduce Types and Formats

General form of map & reduce 
  • map: (k1,v1) -> list(k2,v2)
  • reduce: (k2,list(v2)) ->list(k3,v3)
  • Keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase.
  • Reduce input must have the same types as the map output
  • The number of map tasks are not set, the number of map tasks are equal to the number of splits that the input is turned 
Input formats 
  • Data is passed to the mapper by InputFormat, Input Format is a factory of RecordReader object to extract the (key, value) pairs from input source
  • Each mapper deals with single input split no matter the file size
  • Input Split is the chunk of the input processed by a single map, represented by InputSplit
  • InputSplits created by InputFormat, FileInputFormat default for file types 
  • Can process Text, XML, Binary 
RecordReader 
  • Ensure key, value pair is processed
  • Ensure (k, v) not processed more than once
  • Handle (k,v) which get split
Output formats: Text output, binary output 


Chapter 8 MapReduce Features

Counters
  • Useful for QC, problem diagnostics
  • Two counter groups task & job
    • Task Counter
      • gather information about task level statistics over execution time
      • counts info like, input records, skipped records, output bytes etc
      • they are sent across the network
    • Job Counter
      • measure job level statistics
    • User defined counters 
      • Counters are defined by a Java enum
      • Counters are global
Sorting
  • SortOrder controlled by RawComparator
  • Partial sort, secondary sort (Lab)
Joins

To run map-side join use CompositeInputFormat from the org.apache.hadoop.mapreduce.joinpackage

Side Data: Read only data required to process main data set, ex skip words for wordcount

Distributed Cache: To save network bandwidth, side data files are normally copied to any particular node once per job.

Map only Jobs: Image Processing, ETL, Select * from table, distcp, I/p data sampling 

Partitioning 

Map/reduce job will contains most of the time more than  1 reducer.  So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers). A Partitioner is responsible to perform the partitioning.

For more info this article has detailed explanation 
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Zero reducers only map tasks
  • One reducer when the amount of data is small enough to be processed comfortably by one reducer
Combiners
  • Used for reducing intermediate data
  • Runs locally on single mapper’s  o/p
  • Best suited for commutative and associative operations
  • Code identical to Reducer 
  • conf.setCombinerClass(Reducer.class);
  • May or may not run on the output from some or all mappers

Hadoop Definitive Guide Byte Sized Notes - 4

Chapter 5 Map Reduce Process Flow

General dev process
  • Write & test map & reduce functions in dev/local
  • Deploy job to cluster
  • Tune job later
The Configuration API
  • Collection of configuration properties and their values are instance of Configuration class (org.apache.hadoop.conf package)
  • Properties are stored in xml files
  • Default properties are stored in core-dafult.xml, there are multiple conf files for multiple settings
  • If mapred.job.tracker is set to local then execution is local job runner mode
Setting up Dev
  • Pseudo-distributed cluster is one whose daemons all run on the local machine
  • Use MRUnit for testing Map/Reduce
  • Write separate classes for cleaning the data 
  • Tool, toolrunner helpful for debugging, parsing CLI
Running on Cluster
  • Job's classes must be packaged to a jar file
  • setJarByClass() tells hadoop to find the jar file for your program
  • Use Ant or Maven to create jar file
  • On a cluster map and reduce tasks run in separate JVM
  • waitForCompletion() launches the job and polls for progress
  • Ex: 275GB of input data read from 34GB of compressed files broken into 101Gzip files
  • Typical Job id format: job_YYYYMMDDTTmm_XXXX
  • Tasks belong to Jobs, typical TaskID format task_Jobid_m_XXXXXX; m->;map, r->;reduce
  • Use MR WebUI to track Jobtracker page/ tasks page/nematode
Debugging Job
  • Use WebUI to analyse tasks & tasks detail page 
  • Use custom counters
  • Hadoop Logs: Common logs for system daemons, HDFS audit, MapRed Job hist, MapRed task hist
Tuning Jobs

Can I make it run faster? Analysis done after job is run
  • If mappers are taking less time, then reduce mappers, let each one run longer
  • # of reducers should be less than reduce slots 
  • Combiners associate properties, reduce map output
  • Map output compression
  • Serialization other than Writeable
  • Shuffle tweaks
  • Profiling Tasks: HPROF profiling tool comes with JDK
More than one MapReduce & workflow management

  • ChainMapper/ChainReducer to run multiple maps
  • Linear chain job flow 
    • JobClient.runJob(conf1);
    • JobClient.runJob(conf2);
  • Apache Oozie used for running dependent jobs, composed of workflow engine, coordinator engine (Lab)
    • unlike JobControl Oozie runs as a service 
    • workflow is a DAG of action node and control flow nodes
    • Action node moving files in HDFS, MR, Pig, Hive
    • Control Flow coordinates actions nodes
    • Workflows written in XML and has three control-flow nodes and one action node: a start control node, a map-reduce action node, a kill control node, and an end control node. 
    • All workflows must have one start and one end node.
    • Packaging, deploying and running oozie workflow job
    • Triggered by predicates usual time interval or date or external event