Sunday, October 27, 2013

Big Data Camp 27/10/2013

Big Data Camp was my first un-conference in big data. It’s really different compared to a regular conferences, the sessions are created on the fly, walk in and walk out are welcome. From what I have seen, it was non-technical, focused on the noob to intermediate skill market cap. The host Dave, managed the event very eloquently. Overall, it was good event. I complied small notes on what I liked.   

Talk1: Hadoop in consumer security products
  • This organization implements hadoop technology to analyze the sales of their flagship consumer security products
  • Existing data warehouse products and tools were OK, although they saw lot of customer’s showing up on their site, the sales reduced.
  • The product was not at fault, but the customer analytics was.  The primary touch points or data points viz, server log, click stream etc were fed to hadoop, analytics was performed using sql, hive, mahout, used linear regression for modeling their data.

Talk 2: How to choose data analytics product
This presenter works at Datapad. “There is no one right way, there is a right way for your data, for your team” kinda made sense. Advocated the use of Python and Pandas library.

Talk 3: Block box vs transparent data modeling
I was trying to figure out what his point was, this is my understanding, the functionality of any google product is basically a black box and he was not cool with that or whatever.

Talk 4: Mainframe + Hadoop
This talk was pretty interesting, the presenter works at Syncsort company in CA.  
  • 75% of data stored in mainframes (fact)
  • Insurance, Retails and Banks store data in mainframes
  • Offload mainframe data and batch process it in hadoop
  • Mainframe store in ebcdic data type, (new info)
  • ADP is hiring people for mainframes seriously?? (Maybe I should learn COBAL.. heck no!!)

Talk 5: Time series database – things happening in time
This was cool too, open source database called InfluxDB - Database based on events,
  • HTTP native, show and do the analytics in a browser
  • Read/write, manage and security with Http
  • I asked the presenter, how is this product different form Storm, Spark & StreamInsight, could not answer my question 100%
Some other random tidbits, big data GUI tools, IBM Big sheets, Datameer &Talend

Saturday, October 26, 2013

Python Tools for Data Science


I believe Python is increasingly becoming the de facto language for data science applications. Of the many good feature of this language, string manipulation is ridiculously easy and it is a very simple language to pick up. In this post I put together the tools and IDE’s related to Python which I found useful for data science applications.

Scientific Computing

IDE for Python


Basic math skills for data science  
  • Linear Algebra
  • Statistics
  • Probability
  • Calculus

Saturday, October 19, 2013

Handling M:M ring in Entity Modeling

Complex business process operations can be effectively visualized with Entity Relational ER modeling. ER modeling is proven to be the best way to design and model relational databases, this design choice is based of Dr.Peter Chan’s papers published in 1976. Before discussing the M:M ring problem in ER modeling, I will brief description the basics of ER modeling.

Nouns in an english sentence are usually the entities ex: Customers, Company, Teachers etc. The physical properties of these nouns are attributes for entities. ex height, weight, date, etc.

Entities can be related to each other, the relation is usually verbs/ adverbs ex: “based on”. “for”, “basis off”, “bought from”, “operator of”, “issued for”, “stored in”, “responsible for” etc

Simple ER example would be “Customers can place Orders ”, in this example, the relation is called 1:M, one customer can place more than one order (m for many). In some cases, like social and a person would be a 1:1 relation, one person can have 1 social


Modeling employment history in entities is a bit of a pickle. For instance, consider entities person, company and position. Person can work at a company holding a position  or, we can say a company hired a person for a position, or a position is held by a person employed at a company. Visually this can be as shown below.




It becomes increasingly tricky to answer the following question for each person we want to track the position held, for what company and how long. This configuration leads to what is known as M:M ring. This issue can be resolved by adding three other small entities as shown below.