General dev process
- Write & test map & reduce functions in dev/local
- Deploy job to cluster
- Tune job later
- Collection of configuration properties and their values are instance of Configuration class (org.apache.hadoop.conf package)
- Properties are stored in xml files
- Default properties are stored in core-dafult.xml, there are multiple conf files for multiple settings
- If mapred.job.tracker is set to local then execution is local job runner mode
- Pseudo-distributed cluster is one whose daemons all run on the local machine
- Use MRUnit for testing Map/Reduce
- Write separate classes for cleaning the data
- Tool, toolrunner helpful for debugging, parsing CLI
- Job's classes must be packaged to a jar file
- setJarByClass() tells hadoop to find the jar file for your program
- Use Ant or Maven to create jar file
- On a cluster map and reduce tasks run in separate JVM
- waitForCompletion() launches the job and polls for progress
- Ex: 275GB of input data read from 34GB of compressed files broken into 101Gzip files
- Typical Job id format: job_YYYYMMDDTTmm_XXXX
- Tasks belong to Jobs, typical TaskID format task_Jobid_m_XXXXXX; m->;map, r->;reduce
- Use MR WebUI to track Jobtracker page/ tasks page/nematode
- Use WebUI to analyse tasks & tasks detail page
- Use custom counters
- Hadoop Logs: Common logs for system daemons, HDFS audit, MapRed Job hist, MapRed task hist
Can I make it run faster? Analysis done after job is run
- If mappers are taking less time, then reduce mappers, let each one run longer
- # of reducers should be less than reduce slots
- Combiners associate properties, reduce map output
- Map output compression
- Serialization other than Writeable
- Shuffle tweaks
- Profiling Tasks: HPROF profiling tool comes with JDK
- ChainMapper/ChainReducer to run multiple maps
- Linear chain job flow
- JobClient.runJob(conf1);
- JobClient.runJob(conf2);
- Apache Oozie used for running dependent jobs, composed of workflow engine, coordinator engine (Lab)
- unlike JobControl Oozie runs as a service
- workflow is a DAG of action node and control flow nodes
- Action node moving files in HDFS, MR, Pig, Hive
- Control Flow coordinates actions nodes
- Workflows written in XML and has three control-flow nodes and one action node: a start control node, a map-reduce action node, a kill control node, and an end control node.
- All workflows must have one start and one end node.
- Packaging, deploying and running oozie workflow job
- Triggered by predicates usual time interval or date or external event
No comments:
Post a Comment