Elegant Data: Hadoop Definitive Guide Byte Sized Notes - 2

Monday, October 20, 2014

Hadoop Definitive Guide Byte Sized Notes - 2

Chapter 3 HDFS

Hadoop can be integrated with local file system, S3 etc
Writes always happen at the end of the file, cannot append data in between
Default block size 64 MB

Name Node

Name node holds the file system metadata, namespace and tree in memory, without this metadata there is no way to access files in cluster
Stores this info on local disk in 2 files name space image & edit log
NN keeps track of which blocks makes up a file
Block locations are dynamically updates
NN must be running at all times
When clients reads a file NN will not be a bottleneck

Secondary Name Node

Stores the partial of name node’s namespace image and edit log
In case of name node failure we can merge namespace image and edit log

HDFS Federation

Clusters can scale by adding name nodes, each NN manages a portion of file system namespace
Namespace volumes are independent, Name Nodes don’t communicate with each other and doesn’t affect the availability of other name nodes

High Availability

Pair of NN in active-stand by config
NN’s share use shared storage to share edit log
DN send block reports for both the NN

Fail Over

NN runs a process to monitor failures, if failover detected then it gracefully terminates

Fencing

Previously active name node is prevented from doing damage following will help prevent corruption of active name node

Disable the network port
Kill all process
Revoke access to shared storage

File System

Cannot execute a file in HDFS
Replication not applicable to directories
Globbing wild cards to match multiple files in a directory, sample file selection: input/ncdc/all/190{1,2}.gz

 Reading Data

FileSystemAPI contains static factory methods to open i/p stream
A file in HDFS is path object
Default file system properties are stored in conf/core-site.xml

FSDataInputStream

FileSystem uses FSDataInputStream object to open a file
positioned readable, seek() expensive

If FSDataInputStream fails then sends error to NN and NN sends the next closet block address

Writing Data

FSDataOutputStream main class
If FSDataOutputStream fails then pipeline is closed, partial blocks are deleted, bad data node removed from pipeline, NN replicates under replicated blocks

Replica Placement

1st replica at random node, 2nd replica off rack to random node, 3rd replica same rack as 2nd but diff node, applicable to reducers output as well

Coherency

Same data across all replicated data nodes, sync()

HAR Hadoop Archives

Typically used for small data sets
Packs all small blocks efficiently
Archive tool
Limitations

No archive compression
Immutable to add or remove recreate HAR

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)