Elegant Data

Thursday, March 11, 2021

Add a uuid column to a spark dataframe

Recently, I came across a use case where i had to add a new column uuid in hex to an existing spark dataframe, here are two ways we can achieve that


import pyspark.sql.functions as f
from pyspark.sql.types import StringType

# method 1 use udf 
uuid_udf = f.udf(lambda : str(uuid.uuid4().hex), StringType())
df_with_uuid = df.withColumn('uuid', uuid_udf())

# method 2 use lit 
df_with_uuid = df.withColumn('uuid', f.lit(uuid.uuid4().hex))

Wednesday, August 19, 2020

kwargs vs args in python

What is the difference between kwargs and args?

args is a tuple of anonymous arguments

kwargs is a dictionary of named arguments, it stands for keyword arguments

def func(*args, **kwargs):

print('args: ', args, ' kwargs: ', kwargs)

func('a')

args: ('a',) kwargs: {}

func(a=1,b=2,c=3)

args: () kwargs: {'a':1, 'b':2, 'c':3}

func('x','y',a=1,b=2,c=3)

args: ('x', 'y') kwargs: {'a':1, 'b':2, 'c':3}

Friday, August 10, 2018

Reset password for postgres user and load sample database

In this post, I want talk about three things, first, how to create a password for a postgres user. Second, how to load a csv into a sample database and finally how to access this table using python/psycopg package.
Setup password for Postgres user:
In my previous post, I created a user krishna in Postgres but I haven't come across during installation a password setup step. Quick search in StackOverflow showed this approach

$psql
psql (10.4 (Ubuntu 10.4-0ubuntu0.18.04))
Type "help" for help.

krishna=# \password
Enter new password: 
Enter it again: 
krishna=#

Load csv into a Postgres database.
Found this example of postgres site, it has details of land sales in the UK, going back several decades, and is 3.5GB as of August 2016,click here for the file

-- Executing query:
CREATE TABLE land_registry_price_paid_uk(
  transaction uuid,
  price numeric,
  transfer_date date,
  postcode text,
  property_type char(1),
  newly_built boolean,
  duration char(1),
  paon text,
  saon text,
  street text,
  locality text,
  city text,
  district text,
  county text,
  ppd_category_type char(1),
  record_status char(1));
Query returned successfully with no result in 273 msec.

-- Executing query:
COPY land_registry_price_paid_uk FROM '/home/krishna/Downloads/pp_100k.csv' 
with (format csv, encoding 'win1252', header false, null '', quote '"', force_null (postcode, saon, paon, street, locality, city, district));
Query returned successfully: 100000 rows affected, 1.1 secs execution time.

Install psycopg2 pip install psycopg2

$ pip install psycopg2
Collecting psycopg2
  Downloading https://files.pythonhosted.org/packages/7c/e6/d5161798a5e8900f24216cb730f2c2be5e4758a80d35c8588306831c0c99/psycopg2-2.7.5-cp27-cp27mu-manylinux1_x86_64.whl (2.7MB)
    100% |████████████████████████████████| 2.7MB 316kB/s 
Installing collected packages: psycopg2
Successfully installed psycopg2-2.7.5

Example call from python

#!/usr/bin/python
import psycopg2
import sys
import pprint
 
def main():
	conn_string = "host='localhost' dbname='oflc' user='krishna' password='*****'"
	# print the connection string we will use to connect
	print "Connecting to database\n	-&gt;%s" % (conn_string)
 
	# get a connection, if a connect cannot be made an exception will be raised here
	conn = psycopg2.connect(conn_string)
 
	# conn.cursor will return a cursor object, you can use this cursor to perform queries
	cursor = conn.cursor()
 
	# execute our Query
	cursor.execute("select * from land_registry_price_paid_uk limit 10")
 
	# retrieve the records from the database
	records = cursor.fetchall()
 
	# print out the records using pretty print
	# note that the NAMES of the columns are not shown, instead just indexes.
	# for most people this isn't very useful so we'll show you how to return
	# columns as a dictionary (hash) in the next example.
	pprint.pprint(records)
 
if __name__ == "__main__":
	main()

Tuesday, July 31, 2018

Binary Tree in Python - Part 1

In this post I want to write a little bit about the implementation details of Binary Tree's in my favorite language Python. Before we jump in, conceptually what is a binary tree? A binary tree is a data structure with a node and each node having at most two child items.

Example binary tree from wikipedia

But wait isn't this a tree?

A tree in real life has trunk, and roots typically in the ground, branches, stems and leaves but in comp science we look at an inverted tree structure, we start with the root and go down to the leaves as shown in the numeric example above. Every circle is called a Node and each Node can have left and right child items.

How to materialize a Node?

In Python, we materialize a Node by defining a class as shown below

class Node(object):
  def __init__(self, data):
    self.data=data
    self.left=None
    self.right=None

Why should we define a class called Node?

Think of this node as our abstract or user defined data type, this is similar to int, float etc. By defining a Node we can leverage the left and right child items which are also of type Node. Remember, everything in Python is an object. Take for example a variable assignment statement a=1, if you check the type of a we see its of type int, int itself is a class and every class has a constructor so when you create a variable b = int(), this automatically defaults to value zero.

a=1
print type(a)
#

b=int()
print b
#0 

c=int(1)
print c
#1

Going back to our Node class, we can instantiate objects of type Node this way

root=Node(5)
print root.data, root.left, root.right
# 5 None None

We can manually create a Binary tree structure by assigning left and right child items of root node with a variable of type Node

root=Node(5)
print root.data, root.left, root.right
# 5 None None
n4=Node(4)
print n4.data,n4.left,n4.right
# 4 None None 
n3=Node(3)
print n3.data,n3.left,n3.right
# 3 None None
root.left=n3
root.right=n4
print root.data, root.left.data, root.right.data
# 5 3 4

If we examine the type of root.left or root.right we see its an object of class '__main__.Node'. But why should we assign root.left=n3, why can't we say root.left=3? The latter assignment although assigns value three to root's left child its of type int and cannot attach any other Nodes to root.left. Also if you notice, root.left.data and n3.data are same. Originally root.left was None, later we assigned root.left = n3 so root.left is of type Node and hence we can access the data attribute through the dot notation.

print root.left.data, n3.data 
#3 3

In part-2, I will cover how to automatically insert a new Node and other interesting methods and some common interview questions around binary tree

Saturday, July 28, 2018

Setup/Install Postgres 10 in Ubuntu

Just to make sure we have a clean install, check and purge any existing postgres tools.

krishna@dev:~$ dpkg -l | grep postgres
ii postgresql-client 10+190 all front-end programs for PostgreSQL (supported version)
ii postgresql-client-10 10.4-0ubuntu0.18.04 amd64 front-end programs for PostgreSQL 10
ii postgresql-client-common 190 all manager for multiple PostgreSQL client versions

Run purge

krishna@dev:~$ sudo apt-get --purge remove postgresql-client postgresql-client-10 postgresql-client-common
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
postgresql-client* postgresql-client-10* postgresql-client-common*
0 upgraded, 0 newly installed, 3 to remove and 2 not upgraded.
After this operation, 3,423 kB disk space will be freed.
Do you want to continue? [Y/n] y
(Reading database ... 203926 files and directories currently installed.)
Removing postgresql-client (10+190) ...
Removing postgresql-client-10 (10.4-0ubuntu0.18.04) ...
Removing postgresql-client-common (190) ...
Processing triggers for man-db (2.8.3-2) ...
(Reading database ... 203671 files and directories currently installed.)
Purging configuration files for postgresql-client-common (190) ...

Install postgres10 using apt-get

krishna@dev:~$ sudo apt-get install postgresql-10
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
postgresql-client-10 postgresql-client-common postgresql-common sysstat
Suggested packages:
locales-all postgresql-doc-10 isag
The following NEW packages will be installed:
postgresql-10 postgresql-client-10 postgresql-client-common postgresql-common sysstat
0 upgraded, 5 newly installed, 0 to remove and 2 not upgraded.
Need to get 4,204 kB/5,167 kB of archives.
After this operation, 20.3 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://us.archive.ubuntu.com/ubuntu bionic/main amd64 postgresql-common all 190 [157 kB]
Get:2 http://us.archive.ubuntu.com/ubuntu bionic-updates/main amd64 postgresql-10 amd64 10.4-0ubuntu0.18.04 [3,752 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu bionic/main amd64 sysstat amd64 11.6.1-1 [295 kB]
Fetched 4,204 kB in 3s (1,497 kB/s)
Preconfiguring packages ...
Selecting previously unselected package postgresql-client-common.
(Reading database ... 203669 files and directories currently installed.)
Preparing to unpack .../postgresql-client-common_190_all.deb ...
Unpacking postgresql-client-common (190) ...
Selecting previously unselected package postgresql-client-10.
Preparing to unpack .../postgresql-client-10_10.4-0ubuntu0.18.04_amd64.deb ...
Unpacking postgresql-client-10 (10.4-0ubuntu0.18.04) ...
Selecting previously unselected package postgresql-common.
Preparing to unpack .../postgresql-common_190_all.deb ...
Adding 'diversion of /usr/bin/pg_config to /usr/bin/pg_config.libpq-dev by postgresql-common'
Unpacking postgresql-common (190) ...
Selecting previously unselected package postgresql-10.
Preparing to unpack .../postgresql-10_10.4-0ubuntu0.18.04_amd64.deb ...
Unpacking postgresql-10 (10.4-0ubuntu0.18.04) ...
Selecting previously unselected package sysstat.
Preparing to unpack .../sysstat_11.6.1-1_amd64.deb ...
Unpacking sysstat (11.6.1-1) ...
Setting up sysstat (11.6.1-1) ...

Creating config file /etc/default/sysstat with new version
update-alternatives: using /usr/bin/sar.sysstat to provide /usr/bin/sar (sar) in auto mode
Created symlink /etc/systemd/system/multi-user.target.wants/sysstat.service → /lib/systemd/system/sysstat.service.
Processing triggers for ureadahead (0.100.0-20) ...
Setting up postgresql-client-common (190) ...
Processing triggers for systemd (237-3ubuntu10.3) ...
Setting up postgresql-common (190) ...
Adding user postgres to group ssl-cert

Creating config file /etc/postgresql-common/createcluster.conf with new version
Building PostgreSQL dictionaries from installed myspell/hunspell packages...
en_us
Removing obsolete dictionary files:
Created symlink /etc/systemd/system/multi-user.target.wants/postgresql.service → /lib/systemd/system/postgresql.service.
Processing triggers for man-db (2.8.3-2) ...
Setting up postgresql-client-10 (10.4-0ubuntu0.18.04) ...
update-alternatives: using /usr/share/postgresql/10/man/man1/psql.1.gz to provide /usr/share/man/man1/psql.1.gz (psql.1.gz) in auto mode
Setting up postgresql-10 (10.4-0ubuntu0.18.04) ...
Creating new PostgreSQL cluster 10/main ...
/usr/lib/postgresql/10/bin/initdb -D /var/lib/postgresql/10/main --auth-local peer --auth-host md5
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/10/main ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start

Ver Cluster Port Status Owner Data directory Log file
10 main 5432 down postgres /var/lib/postgresql/10/main /var/log/postgresql/postgresql-10-main.log
update-alternatives: using /usr/share/postgresql/10/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode
Processing triggers for systemd (237-3ubuntu10.3) ...
Processing triggers for ureadahead (0.100.0-20) ...

Postgres is installed at this point

Data file is stored here: /var/lib/postgresql/10/main
Transaction logs are here: /var/log/postgresql/postgresql-10-main.log
Postgres application userid and group:
id postgres

uid=126(postgres) gid=133(postgres) groups=133(postgres),117(ssl-cert)
Postgres is listening on port: 5432
Database encoding is set to "UTF8" and cluster locale is "en_US.UTF-8"

Start script is
/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start

For some reason, when I kick off the instance I kept getting this error

postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.... stopped waiting
pg_ctl: could not start server

Examine the log output.

Ended up restarting the server

postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile status

pg_ctl: no server running

postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile restart

pg_ctl: PID file "/var/lib/postgresql/10/main/postmaster.pid" does not exist

Is server running?

starting server anyway

waiting for server to start.... done

server started

postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile status

pg_ctl: server is running (PID: 15574)

/usr/lib/postgresql/10/bin/postgres "-D" "/var/lib/postgresql/10/main" "-c" "config_file=/etc/postgresql/10/main/postgresql.conf"

let's create a role/user, so you are not using the application user id

postgres@dev:~$ psql

psql (10.4 (Ubuntu 10.4-0ubuntu0.18.04))

Type "help" for help.

postgres=# CREATE ROLE krishna SUPERUSER LOGIN REPLICATION CREATEDB CREATEROLE;

CREATE ROLE

postgres=# CREATE DATABASE krishna OWNER krishna;

CREATE DATABASE

postgres=# \q

postgres@dev:~$ exit

logout

Create database

krishna=# create database oflc;

CREATE DATABASE

List all available databases

krishna=# \l 

                                  List of databases

   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   

-----------+----------+----------+-------------+-------------+-----------------------

 krishna   | krishna  | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 

 oflc      | krishna  | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 

 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 

 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +

           |          |          |             |             | postgres=CTc/postgres

 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +

           |          |          |             |             | postgres=CTc/postgres

(5 rows)

Connect to oflc database and create test table

krishna=# \c oflc
You are now connected to database "oflc" as user "krishna".

oflc=# create table test (c1 int);
CREATE TABLE

oflc=# insert into test values (1);
INSERT 0 1

oflc=# select * from test;
c1
----
1
(1 row)

oflc=# drop table test;

DROP TABLE

Other useful commands

oflc=# \c

You are now connected to database "oflc" as user "krishna".

oflc=# \l

List of databases

-----------+----------+----------+-------------+-------------+-----------------------

(5 rows)

oflc=# \conninfo

You are connected to database "oflc" as user "krishna" via socket in "/var/run/postgresql" at port "5432".

oflc=# select * from information_schema.tables where table_schema='public';

oflc=# \q

Finally, shut down server

postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile stop

waiting for server to shut down.... done

server stopped

To make life easier, wrap start|stop commands into a shell script

cat postgres_instance_manager.sh

#!/bin/bash

#set -x

pg_start() {

sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start'

}

pg_status(){

sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile status'

}

pg_restart(){

sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile restart'

}

pg_stop(){

sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile restart'

}

while getopts ":srtk" OPTION; do

echo ${OPTION}

case ${OPTION} in

pg_start

;;

pg_restart

;;

pg_status

;;

pg_stop

;;

\?)

echo "Usage: postgres_instance_manager.sh [-s | START] [-r | RESTART] [-t | STATUS] [-k | STOP]"

exit 1

;;

esac

done

Sunday, June 18, 2017

How to read HBase table from Scala Spark

Step 1: Create a dummy table called customers in HBase, please refer this link on how to populate this table https://mapr.com/products/mapr-sandbox-hadoop/tutorials/tutorial-getting-started-with-hbase-shell/

hbase(main):004:0> scan '/user/user01/customer'

ROW                                  COLUMN+CELL                                                                                              

 amiller                             column=addr:state, timestamp=1497809527266, value=TX                                                     

 jsmith                              column=addr:city, timestamp=1497809526053, value=denver                                                  

 jsmith                              column=addr:state, timestamp=1497809526080, value=CO                                                     

 jsmith                              column=order:date, timestamp=1497809490021, value=10-18-2014                                             

 jsmith                              column=order:numb, timestamp=1497809526118, value=6666                                                   

 njones                              column=addr:city, timestamp=1497809525860, value=miami                                                   

 njones                              column=addr:state, timestamp=1497809526151, value=TX                                                     

 njones                              column=order:numb, timestamp=1497809525941, value=5555                                                   

 tsimmons                            column=addr:city, timestamp=1497809525998, value=dallas                                                  

 tsimmons                            column=addr:state, timestamp=1497809526023, value=TX                                                     

4 row(s) in 0.0310 seconds

Step 2: Next is reading this table in Spark, I used spark-shell to read the table and keyValueRDD is what we are looking for

[mapr@maprdemo ~]$ /opt/mapr/spark/spark-2.1.0/bin/spark-shell 

Spark context Web UI available at http://10.0.2.15:4040

Spark context available as 'sc' (master = local[*], app id = local-1497831718510).

Spark session available as 'spark'.

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0-mapr-1703

/_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)

Type in expressions to have them evaluated.

Type :help for more information.

scala> 

scala> import org.apache.spark._

import org.apache.spark._

scala> import org.apache.spark.rdd.NewHadoopRDD

import org.apache.spark.rdd.NewHadoopRDD

scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

scala> import org.apache.hadoop.hbase.client.HBaseAdmin

import org.apache.hadoop.hbase.client.HBaseAdmin

scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat

import org.apache.hadoop.hbase.mapreduce.TableInputFormat

scala> import org.apache.hadoop.fs.Path;

import org.apache.hadoop.fs.Path

scala> import org.apache.hadoop.hbase.HColumnDescriptor

import org.apache.hadoop.hbase.HColumnDescriptor

scala> import org.apache.hadoop.hbase.util.Bytes

import org.apache.hadoop.hbase.util.Bytes

scala> import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.client.Put

scala> import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.HTable

scala> 

scala> 

scala> val tableName="/user/user01/customer"

tableName: String = /user/user01/customer

scala> 

scala> 

scala> val hconf = HBaseConfiguration.create()

hconf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, org.apache.hadoop.conf.CoreDefaultProperties, core-site.xml, mapred-default.xml, org.apache.hadoop.mapreduce.conf.MapReduceDefaultProperties, mapred-site.xml, yarn-default.xml, org.apache.hadoop.yarn.conf.YarnDefaultProperties, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, hbase-site.xml

scala> 

scala> hconf.set(TableInputFormat.INPUT_TABLE, tableName)

scala> 

scala> 

scala> 

scala> val admin = new HBaseAdmin(hconf)

warning: there was one deprecation warning; re-run with -deprecation for details

admin: org.apache.hadoop.hbase.client.HBaseAdmin = org.apache.hadoop.hbase.client.HBaseAdmin@2093bb6c

scala> 

scala> val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

hBaseRDD: org.apache.spark.rdd.RDD[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = NewHadoopRDD[0] at newAPIHadoopRDD at :38

scala> val result = hBaseRDD.count()

result: Long = 4

scala> 

scala> val resultRDD = hBaseRDD.map(tuple => tuple._2) 

resultRDD: org.apache.spark.rdd.RDD[org.apache.hadoop.hbase.client.Result] = MapPartitionsRDD[1] at map at :40

scala> resultRDD

res1: org.apache.spark.rdd.RDD[org.apache.hadoop.hbase.client.Result] = MapPartitionsRDD[1] at map at :40

scala> val keyValueRDD = resultRDD.map(result =>

     | (Bytes.toString(result.getRow()).split(" ")(0),

     | Bytes.toString(result.value)))

keyValueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at :44

scala> keyValueRDD.collect()

res2: Array[(String, String)] = Array((amiller,TX), (jsmith,denver), (njones,miami), (tsimmons,dallas))

Friday, April 3, 2015

Bill Gates 2030 Vision

I recently watched this video https://youtu.be/8RETFyDKcw0, Verge interviewed Bill Gates about his vision for 2030, this is the man who predicted every home will have a PC, which turned out to be true, what is Mr.Gates vision fifteen years from would be..

Four key areas for improvement health care, farming, banking and education.

Key take aways:

This man is serious about his goals/visions, each sector has very specific goals, its very hard to come up with goals/vision
Health

Upstream: inventing new vaccines specifically for kids less than five years
Downstream: How do you get them out to kids around the worls
Goal: Currently one out of twenty kids dies before age of 5, this should increase to one in forty

Farming

Better seeds with resistance to heat & low water, which hints GMO stuff but at least educating farmers about the benefits
Improved credit & loan systems for farmers
Increase world food productivity

Education

Emphasis on online learning
Improve critical programming skills
Basics of reading & writing
Tablet computing should have a sandbox to test new code

Banking

In my view banking vision is radical and I see it coming soon, for small transactions banks are loosing money
Banking digital infrastructure should be revised to create a utility type service that lets you move money to pay someone else or switch from one bank to another. This calls in for a new regulatory infrastructure where money transfer system is licensed like phone numbers with banks (switching bank accounts should be like switching phone services)