Saturday, October 25, 2014

Hadoop Definitive Guide Byte Sized Notes - 3

Chapter 4 Hadoop IO
 
Data Integrity in Hadoop
  • For data integrity usually CRC-32 is used but with low end hardware it could be checksum which is corrupt not the data
  • Datanodes are responsible for verifying the data they receive before storing the data and its checksum
  • Applies to data that they receive from clients and from other datanodes during replication
  • Each datanode keeps a persistent log of checksum verifications
LocalFileSystem
  • Hadoop LocalFileSystem performs client-side check summing
  • Creates hidden .filename.crc
  • To disable checksum use RawLocalFileSystem
  • FileSystem fs = new RawLocalFileSystem();
Compression
  • More space to store files, faster data transfer across network
  • Splittable shows whether the compression format supports splitting, that is, whether you can seek to any point in the stream and start reading from some point further on
  • Compression algo have space/time trade off
    • Gzip sits in middle
    • BZip compression better than Gzip but slow
    • LZO, LZ4,Snappy are faster
    • BZip is only splittable
Codecs
  • A codec is the implementation of a compression-decompression algorithm
  • We can compress file before writing, for reading a compressed file, CompressionCodecFactory is used to find file name extension and appropriate algo is used
Compression and Input Splits
  • Use container file format such as Sequence, RCFile or Avro
  • Use compression algo which support splitting
Using Compression in MapReduce
  • To compress the output of a MapReduce job, in the job configuration, set the mapred.output.compress property to true
  • We can compress output of a map
Serialization
  • Serialization process of converting objects to byte stream for transmitting over network, De-serialization is the opposite
  • In Hadoop, inter-process communication between nodes is implemented, using remote procedure calls(RPCs), RPC serialization should be compact, fast, extensible, and interoperable
  • Hadoop uses its own serialization format, Writables, Avro is used to overcome limitation of Writables interface
  • The Writable Interface 
            Interface for serialization
            IntWritable wrapper for Java int, use set() to assign value
            IntWritable writable = new IntWritable();
            writable.set(163);
            or 
            IntWritable writable = new IntWritable(163);
  • For writing we use DataOutput binary stream and for reading we use DataInput binary stream
  • DataOutput convert 163 into byte array, to de serialize the byte array is converted back to IntWritable
  • WritableComparable and comparators
    • Comparison of types is crucial for MapReduce
    • IntWritable implements the WritableComparable interface
    • This interface permits users to compare records read from a stream without de serializing them into objects (no overhead of object creation)
    • keys must be WritableComparable because they are passed to reducer in sorted order
    • There are wrappers for most data types in Java, boolean, byte, short, int, float, long, double, text is UTF-8 sequence
    • BytesWritable is a wrapper for an array of binary data
    • NullWritableis a special type of Writable, as it has a zero-length serialization. No bytes are written to or read from the stream
    • ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these type
  • Writable collections
    • There are six Writablecollection types, ex for 2D array use TwoDArrayWritable
    • We can create custom writable class and it can implements WritableComparable (Lab)
Serialization Frameworks
  • Hadoop has an API for pluggable serialization frameworks
Avro
  • Language-neutral data serialization system
Why Avro?
  • Written in JSON, encoded in binary
  • Language-independent schema has rich schema resolution capabilities
  • .avsc is the conventional extension for an Avro schema, .avro is data
  • Avro’s language interoperability
File-Based Data Structures
  • Special data structure to hold data
SequenceFile
  • Binary encoded key, value pairs
  • To save log files text is not suitable
  • Useful for storing binary key value pairs, .seq ext
  • SequenceFileInputFormat Binary file (k,v)
  • SequenceFileAsTextInputFormat (key.toString(), value.toString())
MapFile
  • Sorted SequenceFilewith an index to permit lookups by key
  • MapFile.Writer to write, .map ext
  • Numbers.map/index contains keys
  • Variation of mapfile
    • SetFile for storing Writable keys
    • ArrayFile int index
    • BloomMapFile fast get()
  • Fix() method is usually used for recreating corrupted indexes 
  • Sequence files can be converted to mapfiles 

Monday, October 20, 2014

Hadoop Definitive Guide Byte Sized Notes - 2

Chapter 3 HDFS
  • Hadoop can be integrated with local file system, S3 etc
  • Writes always happen at the end of the file, cannot append  data in between
  • Default block size 64 MB
Name Node
  • Name node holds the file system metadata, namespace and tree  in memory, without this metadata there is no way to access files in cluster 
  • Stores this info on local disk in 2 files name space image & edit log
  • NN keeps track of which blocks makes up a file
  • Block locations are dynamically updates
  • NN must be running at all times
  • When clients reads a file NN will not be a bottleneck
Secondary Name Node 
  • Stores the partial of name node’s namespace image and edit log
  • In case of name node failure we can merge namespace image and edit log 
HDFS Federation
  • Clusters can scale by adding name nodes, each NN manages a portion of file system namespace 
  • Namespace volumes are independent, Name Nodes don’t communicate with each other and doesn’t affect the availability of other name nodes
High Availability 
  • Pair of NN in active-stand by config
  • NN’s share use shared storage to share edit log
  • DN send block reports for both the NN
Fail Over
  • NN runs a process to monitor failures, if failover detected then it gracefully terminates
Fencing
  • Previously active name node is prevented from doing damage following will help prevent corruption of active name node
    • Disable the network port
    • Kill all process
    • Revoke access to shared storage
File System
  • Cannot execute a file in HDFS
  • Replication not applicable to directories 
  • Globbing wild cards to match multiple files in a directory, sample file selection: input/ncdc/all/190{1,2}.gz

Reading Data 
  • FileSystemAPI contains static factory methods to open i/p stream
  • A file in HDFS is path object
  • Default file system properties are stored in conf/core-site.xml  
    • FSDataInputStream
      • FileSystem uses FSDataInputStream object to open a file
      • positioned readable, seek() expensive
  • If FSDataInputStream fails then sends error to NN  and  NN sends the next closet block address
Writing Data
  • FSDataOutputStream main class
  • If FSDataOutputStream fails then pipeline is closed, partial blocks are deleted, bad data node removed from pipeline, NN replicates under replicated blocks
Replica Placement
  • 1st replica at random node, 2nd replica off rack to random node, 3rd replica same rack as 2nd but diff node, applicable to reducers output as well
Coherency 
  • Same data across all replicated data nodes, sync()
HAR Hadoop Archives
  • Typically used for small data sets
  • Packs all small blocks efficiently 
  • Archive tool
  • Limitations
    • No archive compression
    • Immutable to add or remove recreate HAR

Friday, October 17, 2014

Hadoop Definitive Guide Byte Sized Notes - 1

Chapter 1

A common way of avoiding data loss is through replication, storage - HDFS, analysis- MapReduce

Disk Drive

  • Seek Time: the time taken to move the disk’s head to a particular location on the disk to read or write data, typical RDMBS use B-Tree which has good seek time
  • Transfer Rate: streaming of data, disk bandwidth for Big Data apps we need high transfer rate, coz, its write once read many times

New in Hadoop 2.X
  • HDFS HA
  • HDFS Federation
  • YARN MR2
  • Secure Authentication
Chapter 2 MapReduce

i/p -> map -> shuffle -> sort -> reduce -> o/p

Some diff between MR1/MR2
  1. mapred -> mapreduce
  2. Context object unifies JobConf, OutputCollector, Reporter
  3. JobClient deprecated
  4. map o/p: part-m-nnnn, reduce o/p:  part-r-nnnnn
  5. iterator to iterable  
Data Locality
  • To run a map task on the data node where the data resides
  • map task writes output to local disk not HDFS
  • No data locality for reduce, o/p of all mappers gets transferred to reduce node
  • # of reduce tasks are not governed by size of i/p
  • 0 reduce tasks are possible
Combiner
  • performed on the output of the map task
  • associate property max(0,20,10,15,25) = max( max(0,20,10), max(25,15))
    • boosts the performance, pre filtering data before sent across the wire and reduce has smaller data set to work with

Sunday, June 29, 2014

GasFillUp

There are many website and apps for managing and handling gas receipts, I wanted to try out something of my own. I have been working on a responsive web application called GasFillUp to track gas receipts and provide a simple report to display the gas usage. Its in beta phase http://gasfillup.azurewebsites.net, all through my way I was cherry picking neat and interesting stuff what others created. Its been a great end to end learning experience. Things to do, implement error handling, email support, more charts, other performance improvements.
This project revolves around one simple use case, one user can have more than one car. The app has three distinct features vehicle info, data entry and dashboard with infographic style report to display the historical gas receipts information. Upon registering, users have to create a profile for the vehicle.
Data entry begins by selecting the car.
Initially the dashboard is empty, when user enter the information into the system the dashboard populates with nice infographics.
The app is built using ASP.NET webforms, plain old ADO.NET, MySQL, Bootstrap for responsive interaction. Currently, the site is hosted with Microsoft Azure and the database is with ClearDB. This hosting platform is free hence you get lot of limitations, its quick way to port a website. Microsoft Azure looks really promising. The source code is hosted with GitHub, clone/fork it here GitHub

Saturday, June 14, 2014

Apple WWDC 2014 feast for developers!



Phew..look at the list of technologies announced, enhanced messages, Handoff, SpriteKit, CloudKit, HomeKit, MetalKit,Swift (cross between Javascript, Python) and many more. This is the perfect time to jump and be a part of Apple family.

If you observe the trend in computing machines, hand held devices (x64 bit processor in palm) and servers (throw in 32 GB of RAM) are beefing up like crazy. The in betweeners MBP/Air are slowly decaying. How are the big players coping with this paradigm shift in hardware. Microsoft's answer ditch your PC, adopt Surface Pro. Apple's answer continuity. Its apps which are being shared across devices and the technology HandOff  if it really works that seamlessly it would be a game changer. 
 
Last year apple released the 64 bit processor for the iPad and iPhone this year iOS 8 is loaded with stuff which can truly bring value to the new Eco system. What really surprised was the new programming language Swift, Objective-C was defacto to Apple for three decades. I believe Swift has the potential to be what Java & .NET was for the enterprise. Overall, we barley touched the surface of mobile platform/computing, definitely headed in the ring direction. 

Sunday, February 9, 2014

How to remove extra page with tablix SSRS 2012

Recently I had to create a transaction summary report for my accounts receivable project. To display the transactions I choose a tablix as shown below


The grouping for the tablix was on a sequenceid (assign a rank to all the transactions with one clientid), I applied page break at the end of the group


I thought the report was done, however in the print preview the report was printing a blank page in between the group the fix for that is

Body Width <= Page Width - (Left Margin + Right Margin)



report width= 8.5, left margin = 0.164, right margin = 0.164

(report width – (left margin + right margin)) = 8.172

Click on any white space on the report and hit F4 to see body properties



The Width in the body was 8.2in this threw me off and I was getting blank pages, so reduced it to 8.15 that fixed the issue.


Thursday, February 6, 2014

Extended Properties in SQL Server

Extended properties (EP) are SQL Server objects which can be used for documenting other database objects such as tables, views, functions etc. Extended properties offer a common standard place for information about various database objects

Extended properties have three level categories such as Schema -> Table -> Column.  Adding, altering of extended properties are handled through system stored procedures  

sp_addextendedproperty  -> adds a new extended property to a database object
sp_dropextendedproperty -> removes an extended property from a database object
sp_updateextendedproperty -> updates the value of an existing extended property

The above stored procedures have the following parameters

@name = Name of the extended property  
@value = Comments/Description
@level0type = Schema, @level0name = name of schema (dbo)
@level1type = Table,  @level1name = table name
@level2type = Column, @level2name = column name

The Level0/1/2Type’s are not limited to Schema/Table/Column, refer to http://blogs.lessthandot.com/index.php/datamgmt/datadesign/document-your-sql-server-databases/, for a detailed explanation

Consider a sample schema,

create table dbo.Consumers
(
            SeqID int identity(1,1) not null,
            Name varchar(100) not null,
)

To add a description for the SeqID column we can use ‘sp_addextendedproperty’ system stored procedure

-- creating
exec sp_addextendedproperty
@name = N'MS_Description',
@value = 'Sequence ID to identify consumers, auto incrementing number ',
@level0type = N'Schema', @level0name = dbo,
@level1type = N'Table',  @level1name = Consumers,
@level2type = N'Column', @level2name = SeqID;

-- altering
exec sp_updateextendedproperty
@name = N'MS_Description',
@value = N' Sequence ID to identify consumers, auto incrementing number seed value 1’,
@level0type = N'Schema', @level0name = dbo,
@level1type = N'Table',  @level1name = Consumers,
@level2type = N'Column', @level2name = SeqID;

-- dropping
EXEC sp_dropextendedproperty
@name = N'MS_Description',
@level0type = N'Schema', @level0name = dbo,
@level1type = N'Table',  @level1name = Consumers,
@level2type = N'Column', @level2name = SeqID



The best use of extended properties is getting the column definitions of a table. I always create the following sql query as a stored procedure and filter it by table name

-- viewing
select col.name as [Column], ep.value as  [Description]
from sys.tables tab inner join sys.columns col
on tab.object_id = col.object_id left join sys.extended_properties ep
on tab.object_id = ep.major_id
and col.column_id = ep.minor_id
and ep.name = 'MS_Description'
where tab.name = @TabName