Hadoop Definitive Guide Byte Sized Notes - 2
Chapter 3 HDFS
- Hadoop can be integrated with local file system, S3 etc
- Writes always happen at the end of the file, cannot append data in between
- Default block size 64 MB
Name Node
- Name node holds the file system metadata, namespace and tree in memory, without this metadata there is no way to access files in cluster
- Stores this info on local disk in 2 files name space image & edit log
- NN keeps track of which blocks makes up a file
- Block locations are dynamically updates
- NN must be running at all times
- When clients reads a file NN will not be a bottleneck
Secondary Name Node
- Stores the partial of name node’s namespace image and edit log
- In case of name node failure we can merge namespace image and edit log
HDFS Federation
- Clusters can scale by adding name nodes, each NN manages a portion of file system namespace
- Namespace volumes are independent, Name Nodes don’t communicate with each other and doesn’t affect the availability of other name nodes
High Availability
- Pair of NN in active-stand by config
- NN’s share use shared storage to share edit log
- DN send block reports for both the NN
Fail Over
- NN runs a process to monitor failures, if failover detected then it gracefully terminates
Fencing
- Previously active name node is prevented from doing damage following will help prevent corruption of active name node
- Disable the network port
- Kill all process
- Revoke access to shared storage
File System
- Cannot execute a file in HDFS
- Replication not applicable to directories
- Globbing wild cards to match multiple files in a directory, sample file selection: input/ncdc/all/190{1,2}.gz
Reading Data
- FileSystemAPI contains static factory methods to open i/p stream
- A file in HDFS is path object
- Default file system properties are stored in conf/core-site.xml
- FSDataInputStream
- FileSystem uses FSDataInputStream object to open a file
- positioned readable, seek() expensive
- If FSDataInputStream fails then sends error to NN and NN sends the next closet block address
Writing Data
- FSDataOutputStream main class
- If FSDataOutputStream fails then pipeline is closed, partial blocks are deleted, bad data node removed from pipeline, NN replicates under replicated blocks
Replica Placement
- 1st replica at random node, 2nd replica off rack to random node, 3rd replica same rack as 2nd but diff node, applicable to reducers output as well
Coherency
- Same data across all replicated data nodes, sync()
HAR Hadoop Archives
- Typically used for small data sets
- Packs all small blocks efficiently
- Archive tool
- Limitations
- No archive compression
- Immutable to add or remove recreate HAR
No comments:
Post a Comment