Elegant Data: 2013

Friday, November 29, 2013

TSQL Rambling

Wanted to share my thoughts on three interesting things stumbled across in TSQL

1. SQL Unary Operator

Lets create a simple table as follows

declare @TestPlus table

(

Type varchar(2),

TypeName varchar(10),

code int

)

insert into @TestPlus values ('P', 'Procedure', 1)

insert into @TestPlus values ('T', 'Table', 2)

insert into @TestPlus values ('V', 'View', 3)

insert into @TestPlus values ('F', 'Function', 4)

What is the output for the following select query?

select Type from @TestPlus where Code in ( +1, +++3, +4)

select +++Type from @TestPlus where Type in (+'P', +++'T')

When I saw this select query for a second I thought its trick question on how to use ++ operator, however it turns out that the result has nothing to do with the “+++” operator. Quickly browsed through MSDN and found that it has no effect on the evaluated expression

“Although a unary plus can appear before any numeric expression, it performs no operation on the value returned from the expression.”

http://technet.microsoft.com/en-us/library/ms174362.aspx

SQL Row Constructor

I was surprised to see a query like this

;with cte as

(

select dddid, payorid

from (values ('100100', 1), ('100101', 2), ('100102',3 ) ) f(ClientID, SeqNo)

)

select * from cte where SeqNo = 3

turns out that this type of query is called row constructor a new feature introduced in SQL 2008.

3. SQL Strict Names

Check out the following queries, spot the odd man out

select top 10 * from [dbo].[TestUpSurdTable]

select top 10 * from [dbo].[TestUpSurdTable]_Final

I was expecting two different result sets however the same result set was repeating twice, turns out that the “[ ]” ignored what ever is after the underscore

Sunday, November 17, 2013

Best Data Visualization Libraries

I just observed a trend for inforgraphic style of reporting in business. Infographic reports very tangible, click, drag, drop, add elements which makes the report more like a stand alone app.

I complied a short list visualization libraries based on javascript which I personally used. I will be updating this list as I find something interesting.

1. Google Charts

https://developers.google.com/chart/

2. Data Driven Documents

http://d3js.org/

3. Protovis

http://mbostock.github.io/protovis/

Saturday, November 16, 2013

Visual Studio 2012 Designer Issues

I recently uninstalled visual studio 2012 from my PC which already had VS2010 installed. The uninstall messed up the UI designer dll’s and I got the following errors in SSIS & SSMS

Couldn’t open SSIS packages

The database diagram designer had trouble opening..

The cause the dll’s at the following location are corrupted

C:\Program Files (x86)\Common Files\microsoft shared\MSDesigners8

The fix:

1. Use the installation CD of vs2010 and repair or

2. Just copy the MSDesigners8 files from a PC which doesn’t have this issue.

The second choice is easier, I copied the files from my colleagues PC and replaced the contents in the MSDesigners8 folder.

You if you have the same issue, you can download the dll’s here

https://drive.google.com/file/d/0Bz1hNSwS5Wi4WU95OG5ERTQ4cG8/edit?usp=sharing

Sunday, October 27, 2013

Big Data Camp 27/10/2013

Big Data Camp was my first un-conference in big data. It’s really different compared to a regular conferences, the sessions are created on the fly, walk in and walk out are welcome. From what I have seen, it was non-technical, focused on the noob to intermediate skill market cap. The host Dave, managed the event very eloquently. Overall, it was good event. I complied small notes on what I liked.

Talk1: Hadoop in consumer security products

This organization implements hadoop technology to analyze the sales of their flagship consumer security products
Existing data warehouse products and tools were OK, although they saw lot of customer’s showing up on their site, the sales reduced.
The product was not at fault, but the customer analytics was. The primary touch points or data points viz, server log, click stream etc were fed to hadoop, analytics was performed using sql, hive, mahout, used linear regression for modeling their data.

Talk 2: How to choose data analytics product

This presenter works at Datapad. “There is no one right way, there is a right way for your data, for your team” kinda made sense. Advocated the use of Python and Pandas library.

Talk 3: Block box vs transparent data modeling

I was trying to figure out what his point was, this is my understanding, the functionality of any google product is basically a black box and he was not cool with that or whatever.

Talk 4: Mainframe + Hadoop

This talk was pretty interesting, the presenter works at Syncsort company in CA.

75% of data stored in mainframes (fact)
Insurance, Retails and Banks store data in mainframes
Offload mainframe data and batch process it in hadoop
Mainframe store in ebcdic data type, (new info)
ADP is hiring people for mainframes seriously?? (Maybe I should learn COBAL.. heck no!!)

Talk 5: Time series database – things happening in time

This was cool too, open source database called InfluxDB - Database based on events,

HTTP native, show and do the analytics in a browser
Read/write, manage and security with Http
I asked the presenter, how is this product different form Storm, Spark & StreamInsight, could not answer my question 100%

Some other random tidbits, big data GUI tools, IBM Big sheets, Datameer &Talend

Saturday, October 26, 2013

Python Tools for Data Science

I believe Python is increasingly becoming the de facto language for data science applications. Of the many good feature of this language, string manipulation is ridiculously easy and it is a very simple language to pick up. In this post I put together the tools and IDE’s related to Python which I found useful for data science applications.

Scientific Computing

NumPy http://www.numpy.org/, scientific computing toolkit
SciPy http://www.scipy.org/, scientific computing toolkit
Matplotlib http://matplotlib.org/, graphics and plotting
Scikit http://scikit-learn.org/, machine learning library

IDE for Python

Stani Editor, http://pythonide.blogspot.com/, decent simple editor
PythonXY https://code.google.com/p/pythonxy/, IDE built with Spyder
Liclipse, http://brainwy.github.io/liclipse/ standalone python IDE based on eclipse
PyDev, http://pydev.org/ plugin for Eclipse

Basic math skills for data science

Linear Algebra
Statistics
Probability
Calculus

Saturday, October 19, 2013

Handling M:M ring in Entity Modeling

Complex business process operations can be effectively visualized with Entity Relational ER modeling. ER modeling is proven to be the best way to design and model relational databases, this design choice is based of Dr.Peter Chan’s papers published in 1976. Before discussing the M:M ring problem in ER modeling, I will brief description the basics of ER modeling.

Nouns in an english sentence are usually the entities ex: Customers, Company, Teachers etc. The physical properties of these nouns are attributes for entities. ex height, weight, date, etc.

Entities can be related to each other, the relation is usually verbs/ adverbs ex: “based on”. “for”, “basis off”, “bought from”, “operator of”, “issued for”, “stored in”, “responsible for” etc

Simple ER example would be “Customers can place Orders ”, in this example, the relation is called 1:M, one customer can place more than one order (m for many). In some cases, like social and a person would be a 1:1 relation, one person can have 1 social

Modeling employment history in entities is a bit of a pickle. For instance, consider entities person, company and position. Person can work at a company holding a position or, we can say a company hired a person for a position, or a position is held by a person employed at a company. Visually this can be as shown below.

It becomes increasingly tricky to answer the following question for each person we want to track the position held, for what company and how long. This configuration leads to what is known as M:M ring. This issue can be resolved by adding three other small entities as shown below.

Wednesday, September 18, 2013

TSQL Errors

Error 1: Cannot define PRIMARY KEY constraint on nullable column in table

Cause: set ansi null default on, when you create a new col

Solution: explicitly state not null

Error 2: The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object

Cause: Duplicate value for the desired columns

Solution: Run the following query to check the dup key, assuming unique index is on col1

select Col1 , count(*)
from Customers
group by col1
having count(*) > 1

How to use table name as a variable in stored procedure

Sometimes it would be nice to pass the table name as a variable in a select query, its possible by using dynamic query as shown below

create procedure
declare @tablename varchar(30)

create procedure dbo.Test
@tablename varchar(30),
@val int
as

DECLARE @cmd AS NVARCHAR(max)

SET @cmd = N'SELECT * FROM ' + @table_name +
' WHERE Column1 = ''' + @val + ''''

EXEC sp_executesql @cmd

exec Test

How to series TSQL 2

6. How to find second highest

SELECT COL
FROM TABLE
ORDER BY COL DESC
OFFSET 1 ROWS
FETCH NEXT 1 ROWS ONLY;

7. How to use multiple cte

;with cte1
as
(select col1, null as col2 from Tab),

cte2
as
(
select col2, null as col1 from Tab
)

select * from cte1
union
select * from cte2

8 How to rename column
EXEC sp_rename 'TABLENAME.OLD_COLUMNNAME', 'NEW_COLUMNAME', 'COLUMN';
.

How to series TSQL 1

1. How to change col datatype

-- TransDate was in datetime change to date

alter table Transactions
alter column TransDate date

2. How to create primary key on a col

alter table TransactionsArchive
add constraint PK_TransactionsArchive_KeyID primary key clustered (key_ID);
GO

3. How to re-sequence an identity column

alter table List_NotesType
drop constraint PK_List_CommentsType

alter table List_NotesType
drop column NoteType

alter table List_NotesType
add NoteType tinyint identity(1,1)

alter table List_NotesType
add constraint PK_List_NoteTypeID primary key clustered (NoteType);

4. How to change a not null col to null

alter table [Notes]
alter column Comments varchar(500) NULL

5. How to update existing NULL cols with a value

update Notes
set Comments = t.Description
from Notes n join List_NotesType t on n.Type = t.NoteType
where Noteid in (select Noteid from Notes where Comments is null)

Tuesday, September 17, 2013

How to use cross apply instead of cursors in SQL Server

I recently ran into a stored procedure with cursor logic, I wanted to see what would be the performance gain if cursor logic is rewritten with set theory operations.

Here is a simplified description of the stored procedure. There is a table with five columns, one column is an identity and a primary, (let's call it the ID), three columns (x, y, z) are of type integers and the fifth column (a) is a computed column. The computation is quite complex so it cannot be declared as a computed column expression. For each row, the values of the three columns (x, y, z) are passed in as a parameters to a custom function (where the logic is encapsulated) which spits out a calculated value. Finally, for every row the column "a" is updated with the calculated value from the custom function. There are around 7000 rows in this table.

I put this blogpost on code project as well. Check out the following link
http://www.codeproject.com/Tips/654894/How-to-use-cross-apply-instead-of-cursors-in-SQL-S

Let's create a table called test1 with five columns. For this example lets stick with a simple logic, the fifth column is a sum of cols x,y,z.

Step 1: Create the test table
create table test1
(
id int not null identity(1,1),
x int,
y int,
z int,
a int null
)

Step 2: Insert dummy data
insert into test1 values (5,5,5, NULL)
go 10
-- (I inserted 9132 rows, took five minutes for the code to execute)

Step 3: Using cursors to update the column "a"

declare @x1 int
declare @x2 int
declare @x3 int
declare @x4 int
declare @x5 int
declare c1 cursor local for
select id, x, y, z, a from test1
open c1
while (0=0)
begin
fetch next from c1 into
@x1, @x2,@x3,@x4,@x5

if (@@FETCH_STATUS = -1)
break

-- your logic
set @x5 = @x4 + @x2 + @x3


update test1
set test1.a = @x5
from test1
where id = @x1
end
close c1
deallocate c1

-- Exexcution time 01:09

Step 4: Reset column "a"

update test1 set a = null

Step 5: Create a table valued function (tvf) shown below

create function dbo.fnsomelogic (@x int ,@y int, @z int)
returns @val table
(
q int
)
as
begin
declare @q int
set @q = @x + @y +@z
insert into @val (q) values (@q)
return
end

-- tvf can be invoked as shown below
-- select * from dbo.fnsomelogic(1,2,3)

Step 6: use cross apply and the tvf to update column a

update test1
set test1.a = c.q
from test1 b cross apply dbo.fnsomelogic(b.x,b.y,b.z) c
--Execution Time: (9132 row(s) affected) in less than a second.

If you observe the messages, the cursor which is a row based operator, displays (1 row(s) affected) for every row it updated, unlike the cross apply which displays (9132 row(s) affected). Although the problem is screaming out use cursors, however with little observation a cross apply along with a table valued function can boost the performance significantly. Relational/Set theory concepts are deeply embedded within SQL Server.

Tuesday, September 10, 2013

Stored Procs 101

Stored procedures are programmability features of a database engine.

Advantages of stored procs (sp)

1. Reduce server/client network traffic
2. Stronger security
3. Modularization & Code re-usability
4. Easier maintainance
5. Improved performance

-- 1. basic sp

create procedure TestProc1
as
begin --optional
select top 5 * from sales.customers
end -- optional

-- 2. execute a sp

exec TestProc1 -- or execute TestProc1

-- 3. alter a sp

alter procedure TestProc1
as
begin
select top 10 * from sales.customers
end

-- 4. drop a sp

drop procedure TestProc1

-- 5. Rename/ "sp_" are system defined stored procs

exec sp_rename 'TestProc1', 'TestProc2'

-- 6. View the definition of sp

-- approach 1

exec sp_helptext 'TestProc1'

-- approach 2

select object_definition (object_id('TestProc'))

-- approach 3

select definition
from sys.sql_modules
where object_id = (object_id('TestProc'))

-- 7 sp with one parameter

create procedure TestProc
( -- optional
@custid int
) -- optional
as
select * from sales.customers where custid = @custid

-- execute sp with parameters

exec testproc 3

--or

exec testProc @custid = 3

-- 8 sp with more than one parameter
create procedure TestProc
@custid int,
@contacttitle varchar(30)
as
select * from sales.customers where custid = @custid

and contacttitle = @contacttitle

-- 9 proc which returns data, use output keyword

create procedure TestProc
@custid int,
@contacttitle varchar(30),
@companyname varchar(40) output
as
select * from sales.customers where custid = @custid

and contacttitle = @contacttitle

declare @c varchar(30)

exec testproc 2, 'Owner', @companyname = @c output

select @c -- print @c

If you don't use output keyword then if the sp will return a table if called from .net code.

If we use the output keyword then a single value is returned which can be captured as a .net data type

Error codes of sp with output keyword

0 : Successful execution
1 : Required param is not specified
2 : Specified param value is not valid
3 : Error has occurred in getting the value
4 : Null value found for variable

-- 10 recompiling sp
There are three ways to recompile a sp

exec testproc with recompile

exec sp_recompile 'testproc'

Tuesday, August 27, 2013

Powershell File Copy

Microsoft Powershell is a nifty scripting tool which is deeply integrated with .NET framework. Recently I had an opportunity to write an automated script for copying files greater than a certain date from one location to another. This is a rudimentary task, and there are a ton of tested approaches such as robocopy etc., but I decided to write my own script in powershell.

Check out the implementation details at
http://www.codeproject.com/Tips/642297/File-copy-using-Microsoft-Powershell

Check out the code at
https://github.com/tkmallik/Powershell/blob/master/PSFileCopy.ps1