Posts

How to read HBase table from Scala Spark

Step 1: Create a dummy table called customers in HBase, please refer this link on how to populate this table https://mapr.com/products/mapr-sandbox-hadoop/tutorials/tutorial-getting-started-with-hbase-shell/

hbase(main):004:0> scan '/user/user01/customer' ROW                                  COLUMN+CELL                                                                                                amiller                             column=addr:state, timestamp=1497809527266, value=TX                                                       jsmith                              column=addr:city, timestamp=1497809526053, value=denver                                                    jsmith                              column=addr:state, timestamp=1497809526080, value=CO                                                       jsmith                              column=order:date, timestamp=1497809490021, value=10-18-2014                                               jsmith                 …

Bill Gates 2030 Vision

Image
I recently watched this video https://youtu.be/8RETFyDKcw0, Verge interviewed Bill Gates about his vision for 2030,  this is the man who predicted every home will have a PC, which turned out to be true, what is Mr.Gates vision fifteen years from would be..

Four key areas for improvement health care, farming, banking and education.


Key take aways: This man is serious about his goals/visions, each sector has very specific goals, its very hard to come up with goals/visionHealthUpstream: inventing new vaccines specifically for kids less than five yearsDownstream: How do you get them out to kids around the worlsGoal: Currently one out of twenty kids dies before age of 5, this should increase to one in forty FarmingBetter seeds with resistance to heat & low water, which hints GMO stuff but at least educating farmers about the benefitsImproved credit & loan systems for farmersIncrease world food productivity Education Emphasis on online learning Improve critical programming skillsBasi…

Hadoop Definitive Guide Byte Sized Notes - 6

Chapter 7 MapReduce Types and Formats
General form of map & reduce  map: (k1,v1) -> list(k2,v2)reduce: (k2,list(v2)) ->list(k3,v3)Keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase.Reduce input must have the same types as the map outputThe number of map tasks are not set, the number of map tasks are equal to the number of splits that the input is turned  Input formats Data is passed to the mapper by InputFormat, Input Format is a factory of RecordReader object to extract the (key, value) pairs from input sourceEach mapper deals with single input split no matter the file sizeInput Split is the chunk of the input processed by a single map, represented by InputSplitInputSplits created by InputFormat, FileInputFormat default for file types Can process Text, XML, Binary  RecordReader  Ensure key, value pair is processedEnsure (k, v) not processed more than onceHandle (k,v) which get split Output formats: T…

Hadoop Definitive Guide Byte Sized Notes - 5

Chapter 6 Anatomy of a MapReduce Job Run

Frameworks used for execution is set by mapreduce.framework.name property

local -> local job runnerclassic -> MR1yarn -> MR2
MR1 - Classic

Client submits MR jobJT coordinates the job runJT resides on master nodeTT is actually responsible for instantiating map/reduce tasks
Map Task

May completely choose to ignore the input keyMap outputs zero or more K/V pairsTypical mappers Convert to Upper CaseExplode Mapper Filter MapperChanging Keyspace
Reduce Task

All values associated with a particular intermediate key are guaranteed to go the same reducerIntermediate keys and values are passed to reducer in sorted key order  Typical ReducersSum ReducerIdentity Reducer 
Progress in MR1 means

Reading an input record (in a mapper or reducer)Writing an output record (in a mapper or reducer)Setting the status description on a reporter (using Reporter’s setStatus()method)Incrementing a counter (using Reporter’s incrCounter()method)Calling Reporter’s progress()me…

Hadoop Definitive Guide Byte Sized Notes - 4

Chapter 5 Map Reduce Process Flow

General dev process
Write & test map & reduce functions in dev/localDeploy job to clusterTune job laterThe Configuration API
Collection of configuration properties and their values are instance of Configuration class (org.apache.hadoop.conf package)Properties are stored in xml filesDefault properties are stored in core-dafult.xml, there are multiple conf files for multiple settingsIf mapred.job.tracker is set to local then execution is local job runner modeSetting up Dev
Pseudo-distributed cluster is one whose daemons all run on the local machineUse MRUnit for testing Map/ReduceWrite separate classes for cleaning the data Tool, toolrunner helpful for debugging, parsing CLIRunning on Cluster
Job's classes must be packaged to a jar filesetJarByClass() tells hadoop to find the jar file for your programUse Ant or Maven to create jar fileOn a cluster map and reduce tasks run in separate JVMwaitForCompletion() launches the job and polls for progressE…

Hadoop Definitive Guide Byte Sized Notes - 3

Chapter 4 Hadoop IO Data Integrity in Hadoop For data integrity usually CRC-32 is used but with low end hardware it could be checksum which is corrupt not the dataDatanodes are responsible for verifying the data they receive before storing the data and its checksumApplies to data that they receive from clients and from other datanodes during replicationEach datanode keeps a persistent log of checksum verifications LocalFileSystem Hadoop LocalFileSystem performs client-side check summingCreates hidden .filename.crcTo disable checksum use RawLocalFileSystemFileSystem fs = new RawLocalFileSystem(); Compression More space to store files, faster data transfer across networkSplittable shows whether the compression format supports splitting, that is, whether you can seek to any point in the stream and start reading from some point further onCompression algo have space/time trade offGzip sits in middleBZip compression better than Gzip but slowLZO, LZ4,Snappy are fasterBZip is only splittable Codecs A…

Hadoop Definitive Guide Byte Sized Notes - 2

Chapter 3 HDFS
Hadoop can be integrated with local file system, S3 etcWrites always happen at the end of the file, cannot append  data in betweenDefault block size 64 MBName Node
Name node holds the file system metadata, namespace and tree  in memory, without this metadata there is no way to access files in cluster Stores this info on local disk in 2 files name space image & edit logNN keeps track of which blocks makes up a fileBlock locations are dynamically updatesNN must be running at all timesWhen clients reads a file NN will not be a bottleneckSecondary Name Node 
Stores the partial of name node’s namespace image and edit logIn case of name node failure we can merge namespace image and edit log HDFS Federation
Clusters can scale by adding name nodes, each NN manages a portion of file system namespace Namespace volumes are independent, Name Nodes don’t communicate with each other and doesn’t affect the availability of other name nodesHigh Availability
Pair of NN in active-stand by …