Elegant Data: 2014

Monday, December 8, 2014

Hadoop Definitive Guide Byte Sized Notes - 6

Chapter 7 MapReduce Types and Formats

General form of map & reduce

map: (k1,v1) -> list(k2,v2)
reduce: (k2,list(v2)) ->list(k3,v3)
Keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase.
Reduce input must have the same types as the map output
The number of map tasks are not set, the number of map tasks are equal to the number of splits that the input is turned

Input formats

Data is passed to the mapper by InputFormat, Input Format is a factory of RecordReader object to extract the (key, value) pairs from input source
Each mapper deals with single input split no matter the file size
Input Split is the chunk of the input processed by a single map, represented by InputSplit
InputSplits created by InputFormat, FileInputFormat default for file types
Can process Text, XML, Binary

RecordReader

Ensure key, value pair is processed
Ensure (k, v) not processed more than once
Handle (k,v) which get split

Output formats: Text output, binary output

Chapter 8 MapReduce Features

Counters

Useful for QC, problem diagnostics
Two counter groups task & job

Task Counter

gather information about task level statistics over execution time
counts info like, input records, skipped records, output bytes etc
they are sent across the network

Job Counter

measure job level statistics

User defined counters

Counters are defined by a Java enum
Counters are global

Sorting

SortOrder controlled by RawComparator
Partial sort, secondary sort (Lab)

Joins

To run map-side join use CompositeInputFormat from the org.apache.hadoop.mapreduce.joinpackage

Side Data: Read only data required to process main data set, ex skip words for wordcount

Distributed Cache: To save network bandwidth, side data files are normally copied to any particular node once per job.

Map only Jobs: Image Processing, ETL, Select * from table, distcp, I/p data sampling

Partitioning

Map/reduce job will contains most of the time more than 1 reducer. So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers). A Partitioner is responsible to perform the partitioning.

For more info this article has detailed explanation

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Zero reducers only map tasks

One reducer when the amount of data is small enough to be processed comfortably by one reducer

Combiners

Used for reducing intermediate data
Runs locally on single mapper’s o/p
Best suited for commutative and associative operations
Code identical to Reducer
conf.setCombinerClass(Reducer.class);
May or may not run on the output from some or all mappers

Hadoop Definitive Guide Byte Sized Notes - 4

Chapter 5 Map Reduce Process Flow

General dev process

Write & test map & reduce functions in dev/local
Deploy job to cluster
Tune job later

The Configuration API

Collection of configuration properties and their values are instance of Configuration class (org.apache.hadoop.conf package)
Properties are stored in xml files
Default properties are stored in core-dafult.xml, there are multiple conf files for multiple settings
If mapred.job.tracker is set to local then execution is local job runner mode

Setting up Dev

Pseudo-distributed cluster is one whose daemons all run on the local machine
Use MRUnit for testing Map/Reduce
Write separate classes for cleaning the data
Tool, toolrunner helpful for debugging, parsing CLI

Running on Cluster

Job's classes must be packaged to a jar file
setJarByClass() tells hadoop to find the jar file for your program
Use Ant or Maven to create jar file
On a cluster map and reduce tasks run in separate JVM
waitForCompletion() launches the job and polls for progress
Ex: 275GB of input data read from 34GB of compressed files broken into 101Gzip files
Typical Job id format: job_YYYYMMDDTTmm_XXXX
Tasks belong to Jobs, typical TaskID format task_Jobid_m_XXXXXX; m->;map, r->;reduce
Use MR WebUI to track Jobtracker page/ tasks page/nematode

Debugging Job

Use WebUI to analyse tasks & tasks detail page
Use custom counters
Hadoop Logs: Common logs for system daemons, HDFS audit, MapRed Job hist, MapRed task hist

Tuning Jobs

Can I make it run faster? Analysis done after job is run

If mappers are taking less time, then reduce mappers, let each one run longer
# of reducers should be less than reduce slots
Combiners associate properties, reduce map output
Map output compression
Serialization other than Writeable
Shuffle tweaks
Profiling Tasks: HPROF profiling tool comes with JDK

More than one MapReduce & workflow management

ChainMapper/ChainReducer to run multiple maps
Linear chain job flow

JobClient.runJob(conf1);
JobClient.runJob(conf2);

Apache Oozie used for running dependent jobs, composed of workflow engine, coordinator engine (Lab)

unlike JobControl Oozie runs as a service
workflow is a DAG of action node and control flow nodes
Action node moving files in HDFS, MR, Pig, Hive
Control Flow coordinates actions nodes
Workflows written in XML and has three control-flow nodes and one action node: a start control node, a map-reduce action node, a kill control node, and an end control node.
All workflows must have one start and one end node.
Packaging, deploying and running oozie workflow job
Triggered by predicates usual time interval or date or external event

Saturday, October 25, 2014

Hadoop Definitive Guide Byte Sized Notes - 3

Chapter 4 Hadoop IO

Data Integrity in Hadoop

For data integrity usually CRC-32 is used but with low end hardware it could be checksum which is corrupt not the data
Datanodes are responsible for verifying the data they receive before storing the data and its checksum
Applies to data that they receive from clients and from other datanodes during replication
Each datanode keeps a persistent log of checksum verifications

LocalFileSystem

Hadoop LocalFileSystem performs client-side check summing
Creates hidden .filename.crc
To disable checksum use RawLocalFileSystem
FileSystem fs = new RawLocalFileSystem();

Compression

More space to store files, faster data transfer across network
Splittable shows whether the compression format supports splitting, that is, whether you can seek to any point in the stream and start reading from some point further on
Compression algo have space/time trade off

Gzip sits in middle
BZip compression better than Gzip but slow
LZO, LZ4,Snappy are faster
BZip is only splittable

Codecs

A codec is the implementation of a compression-decompression algorithm
We can compress file before writing, for reading a compressed file, CompressionCodecFactory is used to find file name extension and appropriate algo is used

Compression and Input Splits

Use container file format such as Sequence, RCFile or Avro
Use compression algo which support splitting

Using Compression in MapReduce

To compress the output of a MapReduce job, in the job configuration, set the mapred.output.compress property to true
We can compress output of a map

Serialization

Serialization process of converting objects to byte stream for transmitting over network, De-serialization is the opposite
In Hadoop, inter-process communication between nodes is implemented, using remote procedure calls(RPCs), RPC serialization should be compact, fast, extensible, and interoperable
Hadoop uses its own serialization format, Writables, Avro is used to overcome limitation of Writables interface
The Writable Interface

Interface for serialization

IntWritable wrapper for Java int, use set() to assign value

IntWritable writable = new IntWritable();

writable.set(163);

IntWritable writable = new IntWritable(163);

For writing we use DataOutput binary stream and for reading we use DataInput binary stream
DataOutput convert 163 into byte array, to de serialize the byte array is converted back to IntWritable
WritableComparable and comparators

Comparison of types is crucial for MapReduce
IntWritable implements the WritableComparable interface
This interface permits users to compare records read from a stream without de serializing them into objects (no overhead of object creation)
keys must be WritableComparable because they are passed to reducer in sorted order
There are wrappers for most data types in Java, boolean, byte, short, int, float, long, double, text is UTF-8 sequence
BytesWritable is a wrapper for an array of binary data
NullWritableis a special type of Writable, as it has a zero-length serialization. No bytes are written to or read from the stream
ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these type

Writable collections

There are six Writablecollection types, ex for 2D array use TwoDArrayWritable
We can create custom writable class and it can implements WritableComparable (Lab)

Serialization Frameworks

Hadoop has an API for pluggable serialization frameworks

Avro

Language-neutral data serialization system

Why Avro?

Written in JSON, encoded in binary
Language-independent schema has rich schema resolution capabilities
.avsc is the conventional extension for an Avro schema, .avro is data
Avro’s language interoperability

File-Based Data Structures

Special data structure to hold data

SequenceFile

Binary encoded key, value pairs
To save log files text is not suitable
Useful for storing binary key value pairs, .seq ext
SequenceFileInputFormat Binary file (k,v)
SequenceFileAsTextInputFormat (key.toString(), value.toString())

MapFile

Sorted SequenceFilewith an index to permit lookups by key
MapFile.Writer to write, .map ext
Numbers.map/index contains keys
Variation of mapfile

SetFile for storing Writable keys
ArrayFile int index
BloomMapFile fast get()

Fix() method is usually used for recreating corrupted indexes
Sequence files can be converted to mapfiles

Monday, October 20, 2014

Hadoop Definitive Guide Byte Sized Notes - 2

Chapter 3 HDFS

Hadoop can be integrated with local file system, S3 etc
Writes always happen at the end of the file, cannot append data in between
Default block size 64 MB

Name Node

Name node holds the file system metadata, namespace and tree in memory, without this metadata there is no way to access files in cluster
Stores this info on local disk in 2 files name space image & edit log
NN keeps track of which blocks makes up a file
Block locations are dynamically updates
NN must be running at all times
When clients reads a file NN will not be a bottleneck

Secondary Name Node

Stores the partial of name node’s namespace image and edit log
In case of name node failure we can merge namespace image and edit log

HDFS Federation

Clusters can scale by adding name nodes, each NN manages a portion of file system namespace
Namespace volumes are independent, Name Nodes don’t communicate with each other and doesn’t affect the availability of other name nodes

High Availability

Pair of NN in active-stand by config
NN’s share use shared storage to share edit log
DN send block reports for both the NN

Fail Over

NN runs a process to monitor failures, if failover detected then it gracefully terminates

Fencing

Previously active name node is prevented from doing damage following will help prevent corruption of active name node

Disable the network port
Kill all process
Revoke access to shared storage

File System

Cannot execute a file in HDFS
Replication not applicable to directories
Globbing wild cards to match multiple files in a directory, sample file selection: input/ncdc/all/190{1,2}.gz

 Reading Data

FileSystemAPI contains static factory methods to open i/p stream
A file in HDFS is path object
Default file system properties are stored in conf/core-site.xml

FSDataInputStream

FileSystem uses FSDataInputStream object to open a file
positioned readable, seek() expensive

If FSDataInputStream fails then sends error to NN and NN sends the next closet block address

Writing Data

FSDataOutputStream main class
If FSDataOutputStream fails then pipeline is closed, partial blocks are deleted, bad data node removed from pipeline, NN replicates under replicated blocks

Replica Placement

1st replica at random node, 2nd replica off rack to random node, 3rd replica same rack as 2nd but diff node, applicable to reducers output as well

Coherency

Same data across all replicated data nodes, sync()

HAR Hadoop Archives

Typically used for small data sets
Packs all small blocks efficiently
Archive tool
Limitations

No archive compression
Immutable to add or remove recreate HAR

Friday, October 17, 2014

Hadoop Definitive Guide Byte Sized Notes - 1

Chapter 1

A common way of avoiding data loss is through replication, storage - HDFS, analysis- MapReduce

Disk Drive

Seek Time: the time taken to move the disk’s head to a particular location on the disk to read or write data, typical RDMBS use B-Tree which has good seek time

Transfer Rate: streaming of data, disk bandwidth for Big Data apps we need high transfer rate, coz, its write once read many times

New in Hadoop 2.X

HDFS HA
HDFS Federation
YARN MR2
Secure Authentication

Chapter 2 MapReduce

i/p -> map -> shuffle -> sort -> reduce -> o/p

Some diff between MR1/MR2

mapred -> mapreduce
Context object unifies JobConf, OutputCollector, Reporter
JobClient deprecated
map o/p: part-m-nnnn, reduce o/p: part-r-nnnnn
iterator to iterable

Data Locality

To run a map task on the data node where the data resides
map task writes output to local disk not HDFS
No data locality for reduce, o/p of all mappers gets transferred to reduce node
# of reduce tasks are not governed by size of i/p
0 reduce tasks are possible

Combiner

performed on the output of the map task
associate property max(0,20,10,15,25) = max( max(0,20,10), max(25,15))
- boosts the performance, pre filtering data before sent across the wire and reduce has smaller data set to work with

Sunday, June 29, 2014

GasFillUp

There are many website and apps for managing and handling gas receipts, I wanted to try out something of my own. I have been working on a responsive web application called GasFillUp to track gas receipts and provide a simple report to display the gas usage. Its in beta phase http://gasfillup.azurewebsites.net, all through my way I was cherry picking neat and interesting stuff what others created. Its been a great end to end learning experience. Things to do, implement error handling, email support, more charts, other performance improvements.

This project revolves around one simple use case, one user can have more than one car. The app has three distinct features vehicle info, data entry and dashboard with infographic style report to display the historical gas receipts information. Upon registering, users have to create a profile for the vehicle.

Data entry begins by selecting the car.

Initially the dashboard is empty, when user enter the information into the system the dashboard populates with nice infographics.

The app is built using ASP.NET webforms, plain old ADO.NET, MySQL, Bootstrap for responsive interaction. Currently, the site is hosted with Microsoft Azure and the database is with ClearDB. This hosting platform is free hence you get lot of limitations, its quick way to port a website. Microsoft Azure looks really promising. The source code is hosted with GitHub, clone/fork it here GitHub

Saturday, June 14, 2014

Apple WWDC 2014 feast for developers!

Phew..look at the list of technologies announced, enhanced messages, Handoff, SpriteKit, CloudKit, HomeKit, MetalKit,Swift (cross between Javascript, Python) and many more. This is the perfect time to jump and be a part of Apple family.

If you observe the trend in computing machines, hand held devices (x64 bit processor in palm) and servers (throw in 32 GB of RAM) are beefing up like crazy. The in betweeners MBP/Air are slowly decaying. How are the big players coping with this paradigm shift in hardware. Microsoft's answer ditch your PC, adopt Surface Pro. Apple's answer continuity. Its apps which are being shared across devices and the technology HandOff if it really works that seamlessly it would be a game changer.

Last year apple released the 64 bit processor for the iPad and iPhone this year iOS 8 is loaded with stuff which can truly bring value to the new Eco system. What really surprised was the new programming language Swift, Objective-C was defacto to Apple for three decades. I believe Swift has the potential to be what Java & .NET was for the enterprise. Overall, we barley touched the surface of mobile platform/computing, definitely headed in the ring direction.

Sunday, February 9, 2014

How to remove extra page with tablix SSRS 2012

Recently I had to create a transaction summary report for my accounts receivable project. To display the transactions I choose a tablix as shown below

The grouping for the tablix was on a sequenceid (assign a rank to all the transactions with one clientid), I applied page break at the end of the group

I thought the report was done, however in the print preview the report was printing a blank page in between the group the fix for that is

Body Width <= Page Width - (Left Margin + Right Margin)

report width= 8.5, left margin = 0.164, right margin = 0.164

(report width – (left margin + right margin)) = 8.172

Click on any white space on the report and hit F4 to see body properties

The Width in the body was 8.2in this threw me off and I was getting blank pages, so reduced it to 8.15 that fixed the issue.

Thursday, February 6, 2014

Extended Properties in SQL Server

Extended properties (EP) are SQL Server objects which can be used for documenting other database objects such as tables, views, functions etc. Extended properties offer a common standard place for information about various database objects

Extended properties have three level categories such as Schema -> Table -> Column. Adding, altering of extended properties are handled through system stored procedures

sp_addextendedproperty -> adds a new extended property to a database object

sp_dropextendedproperty -> removes an extended property from a database object

sp_updateextendedproperty -> updates the value of an existing extended property

The above stored procedures have the following parameters

@name = Name of the extended property

@value = Comments/Description

@level0type = Schema, @level0name = name of schema (dbo)

@level1type = Table, @level1name = table name

@level2type = Column, @level2name = column name

The Level0/1/2Type’s are not limited to Schema/Table/Column, refer to http://blogs.lessthandot.com/index.php/datamgmt/datadesign/document-your-sql-server-databases/, for a detailed explanation

Consider a sample schema,

create table dbo.Consumers

(

SeqID int identity(1,1) not null,

Name varchar(100) not null,

)

To add a description for the SeqID column we can use ‘sp_addextendedproperty’ system stored procedure

-- creating

exec sp_addextendedproperty

@name = N'MS_Description',

@value = 'Sequence ID to identify consumers, auto incrementing number ',

@level0type = N'Schema', @level0name = dbo,

@level1type = N'Table', @level1name = Consumers,

@level2type = N'Column', @level2name = SeqID;

-- altering

exec sp_updateextendedproperty

@name = N'MS_Description',

@value = N' Sequence ID to identify consumers, auto incrementing number seed value 1’,

@level0type = N'Schema', @level0name = dbo,

@level1type = N'Table', @level1name = Consumers,

@level2type = N'Column', @level2name = SeqID;

-- dropping

EXEC sp_dropextendedproperty

@name = N'MS_Description',

@level0type = N'Schema', @level0name = dbo,

@level1type = N'Table', @level1name = Consumers,

@level2type = N'Column', @level2name = SeqID

The best use of extended properties is getting the column definitions of a table. I always create the following sql query as a stored procedure and filter it by table name

-- viewing

select col.name as [Column], ep.value as [Description]

from sys.tables tab inner join sys.columns col

on tab.object_id = col.object_id left join sys.extended_properties ep

on tab.object_id = ep.major_id

and col.column_id = ep.minor_id

and ep.name = 'MS_Description'

where tab.name = @TabName

Wednesday, January 29, 2014

Note on precision & scale in SQL Server

Generally for storing a variable like 123.45 we can use a decimal or numeric data type. The usual notation for these data types are decimal [ (p[ ,s] )] & numeric[ (p[ ,s] )], p stands for precision & s stands for scale.

Precision is number digits in variable

Scale is of digits to the right of the decimal point

For ex: 12.345 p = 5 & s = 3

The sample query explains the behavior of the decimal, float & int

declare @t table(

c1 int,

c2 float,

c3 decimal(18,2)

)

insert into @t (c1,c2,c3) select 12.345, 12.345, 12.345

insert into @t (c1,c2,c3) select 10/3, 10/3, 10/3

insert into @t (c1,c2,c3) select 10/3.0, 10/3.0, 10/3.0

insert into @t (c1,c2,c3) select 10/3.0, 10/3.0, round(10/3.0, 2)

select * from @t

c1 c2 c3

12 12.345 12.35

Note: The value in C3 is truncated to two decimal places, C1 completely ignored the fraction part.

3 3 3.00

Note: The fraction part is completely ignored because the denominator is an integer

3 3.333333 3.33

Note: The fraction part is retained correctly

3 3.333333 3.33

Note: The fraction part can also be achieved by rounding the value using round function