import pyspark.sql.functions as f
from pyspark.sql.types import StringType
# method 1 use udf
uuid_udf = f.udf(lambda : str(uuid.uuid4().hex), StringType())
df_with_uuid = df.withColumn('uuid', uuid_udf())
# method 2 use lit
df_with_uuid = df.withColumn('uuid', f.lit(uuid.uuid4().hex))
Elegant Data
Thursday, March 11, 2021
Add a uuid column to a spark dataframe
Recently, I came across a use case where i had to add a new column uuid in hex to an existing spark dataframe, here are two ways we can achieve that
Wednesday, August 19, 2020
kwargs vs args in python
What is the difference between kwargs and args?
args is a tuple of anonymous arguments
kwargs is a dictionary of named arguments, it stands for keyword arguments
def func(*args, **kwargs):
print('args: ', args, ' kwargs: ', kwargs)
func('a')
args: ('a',) kwargs: {}
func(a=1,b=2,c=3)
args: () kwargs: {'a':1, 'b':2, 'c':3}
func('x','y',a=1,b=2,c=3)
args: ('x', 'y') kwargs: {'a':1, 'b':2, 'c':3}
Friday, August 10, 2018
Reset password for postgres user and load sample database
In this post, I want talk about three things, first, how to create a password for a postgres user. Second, how to load a csv into a sample database and finally how to access this table using python/psycopg package.
Setup password for Postgres user:
In my previous post, I created a user krishna in Postgres but I haven't come across during installation a password setup step. Quick search in StackOverflow showed this approach
Found this example of postgres site, it has details of land sales in the UK, going back several decades, and is 3.5GB as of August 2016,click here for the file
Install psycopg2 pip install psycopg2
Setup password for Postgres user:
In my previous post, I created a user krishna in Postgres but I haven't come across during installation a password setup step. Quick search in StackOverflow showed this approach
$psql psql (10.4 (Ubuntu 10.4-0ubuntu0.18.04)) Type "help" for help. krishna=# \password Enter new password: Enter it again: krishna=#Load csv into a Postgres database.
Found this example of postgres site, it has details of land sales in the UK, going back several decades, and is 3.5GB as of August 2016,click here for the file
-- Executing query: CREATE TABLE land_registry_price_paid_uk( transaction uuid, price numeric, transfer_date date, postcode text, property_type char(1), newly_built boolean, duration char(1), paon text, saon text, street text, locality text, city text, district text, county text, ppd_category_type char(1), record_status char(1)); Query returned successfully with no result in 273 msec. -- Executing query: COPY land_registry_price_paid_uk FROM '/home/krishna/Downloads/pp_100k.csv' with (format csv, encoding 'win1252', header false, null '', quote '"', force_null (postcode, saon, paon, street, locality, city, district)); Query returned successfully: 100000 rows affected, 1.1 secs execution time.
Install psycopg2 pip install psycopg2
$ pip install psycopg2 Collecting psycopg2 Downloading https://files.pythonhosted.org/packages/7c/e6/d5161798a5e8900f24216cb730f2c2be5e4758a80d35c8588306831c0c99/psycopg2-2.7.5-cp27-cp27mu-manylinux1_x86_64.whl (2.7MB) 100% |████████████████████████████████| 2.7MB 316kB/s Installing collected packages: psycopg2 Successfully installed psycopg2-2.7.5Example call from python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | #!/usr/bin/python import psycopg2 import sys import pprint def main(): conn_string = "host='localhost' dbname='oflc' user='krishna' password='*****'" # print the connection string we will use to connect print "Connecting to database\n ->%s" % (conn_string) # get a connection, if a connect cannot be made an exception will be raised here conn = psycopg2.connect(conn_string) # conn.cursor will return a cursor object, you can use this cursor to perform queries cursor = conn.cursor() # execute our Query cursor.execute("select * from land_registry_price_paid_uk limit 10") # retrieve the records from the database records = cursor.fetchall() # print out the records using pretty print # note that the NAMES of the columns are not shown, instead just indexes. # for most people this isn't very useful so we'll show you how to return # columns as a dictionary (hash) in the next example. pprint.pprint(records) if __name__ == "__main__": main() |
Tuesday, July 31, 2018
Binary Tree in Python - Part 1
In this post I want to write a little bit about the implementation details of Binary Tree's in my favorite language Python. Before we jump in, conceptually what is a binary tree? A binary tree is a data structure with a node and each node having at most two child items.
But wait isn't this a tree?
A tree in real life has trunk, and roots typically in the ground, branches, stems and leaves but in comp science we look at an inverted tree structure, we start with the root and go down to the leaves as shown in the numeric example above. Every circle is called a Node and each Node can have left and right child items.
How to materialize a Node?
In Python, we materialize a Node by defining a class as shown below
Why should we define a class called Node?
Think of this node as our abstract or user defined data type, this is similar to int, float etc. By defining a Node we can leverage the left and right child items which are also of type Node. Remember, everything in Python is an object. Take for example a variable assignment statement a=1, if you check the type of a we see its of type int, int itself is a class and every class has a constructor so when you create a variable b = int(), this automatically defaults to value zero.
Going back to our Node class, we can instantiate objects of type Node this way
Example binary tree from wikipedia |
A tree in real life has trunk, and roots typically in the ground, branches, stems and leaves but in comp science we look at an inverted tree structure, we start with the root and go down to the leaves as shown in the numeric example above. Every circle is called a Node and each Node can have left and right child items.
How to materialize a Node?
In Python, we materialize a Node by defining a class as shown below
class Node(object): def __init__(self, data): self.data=data self.left=None self.right=None
Why should we define a class called Node?
Think of this node as our abstract or user defined data type, this is similar to int, float etc. By defining a Node we can leverage the left and right child items which are also of type Node. Remember, everything in Python is an object. Take for example a variable assignment statement a=1, if you check the type of a we see its of type int, int itself is a class and every class has a constructor so when you create a variable b = int(), this automatically defaults to value zero.
a=1 print type(a) #b=int() print b #0 c=int(1) print c #1
Going back to our Node class, we can instantiate objects of type Node this way
root=Node(5) print root.data, root.left, root.right # 5 None NoneWe can manually create a Binary tree structure by assigning left and right child items of root node with a variable of type Node
root=Node(5) print root.data, root.left, root.right # 5 None None n4=Node(4) print n4.data,n4.left,n4.right # 4 None None n3=Node(3) print n3.data,n3.left,n3.right # 3 None None root.left=n3 root.right=n4 print root.data, root.left.data, root.right.data # 5 3 4If we examine the type of root.left or root.right we see its an object of class '__main__.Node'. But why should we assign root.left=n3, why can't we say root.left=3? The latter assignment although assigns value three to root's left child its of type int and cannot attach any other Nodes to root.left. Also if you notice, root.left.data and n3.data are same. Originally root.left was None, later we assigned root.left = n3 so root.left is of type Node and hence we can access the data attribute through the dot notation.
print root.left.data, n3.data #3 3In part-2, I will cover how to automatically insert a new Node and other interesting methods and some common interview questions around binary tree
Saturday, July 28, 2018
Setup/Install Postgres 10 in Ubuntu
Just to make sure we have a clean install, check and purge any existing postgres tools.
krishna@dev:~$ dpkg -l | grep postgres
ii postgresql-client 10+190 all front-end programs for PostgreSQL (supported version)
ii postgresql-client-10 10.4-0ubuntu0.18.04 amd64 front-end programs for PostgreSQL 10
ii postgresql-client-common 190 all manager for multiple PostgreSQL client versions
Run purge
krishna@dev:~$ sudo apt-get --purge remove postgresql-client postgresql-client-10 postgresql-client-common
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
postgresql-client* postgresql-client-10* postgresql-client-common*
0 upgraded, 0 newly installed, 3 to remove and 2 not upgraded.
After this operation, 3,423 kB disk space will be freed.
Do you want to continue? [Y/n] y
(Reading database ... 203926 files and directories currently installed.)
Removing postgresql-client (10+190) ...
Removing postgresql-client-10 (10.4-0ubuntu0.18.04) ...
Removing postgresql-client-common (190) ...
Processing triggers for man-db (2.8.3-2) ...
(Reading database ... 203671 files and directories currently installed.)
Purging configuration files for postgresql-client-common (190) ...
Install postgres10 using apt-get
krishna@dev:~$ sudo apt-get install postgresql-10
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
postgresql-client-10 postgresql-client-common postgresql-common sysstat
Suggested packages:
locales-all postgresql-doc-10 isag
The following NEW packages will be installed:
postgresql-10 postgresql-client-10 postgresql-client-common postgresql-common sysstat
0 upgraded, 5 newly installed, 0 to remove and 2 not upgraded.
Need to get 4,204 kB/5,167 kB of archives.
After this operation, 20.3 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://us.archive.ubuntu.com/ubuntu bionic/main amd64 postgresql-common all 190 [157 kB]
Get:2 http://us.archive.ubuntu.com/ubuntu bionic-updates/main amd64 postgresql-10 amd64 10.4-0ubuntu0.18.04 [3,752 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu bionic/main amd64 sysstat amd64 11.6.1-1 [295 kB]
Fetched 4,204 kB in 3s (1,497 kB/s)
Preconfiguring packages ...
Selecting previously unselected package postgresql-client-common.
(Reading database ... 203669 files and directories currently installed.)
Preparing to unpack .../postgresql-client-common_190_all.deb ...
Unpacking postgresql-client-common (190) ...
Selecting previously unselected package postgresql-client-10.
Preparing to unpack .../postgresql-client-10_10.4-0ubuntu0.18.04_amd64.deb ...
Unpacking postgresql-client-10 (10.4-0ubuntu0.18.04) ...
Selecting previously unselected package postgresql-common.
Preparing to unpack .../postgresql-common_190_all.deb ...
Adding 'diversion of /usr/bin/pg_config to /usr/bin/pg_config.libpq-dev by postgresql-common'
Unpacking postgresql-common (190) ...
Selecting previously unselected package postgresql-10.
Preparing to unpack .../postgresql-10_10.4-0ubuntu0.18.04_amd64.deb ...
Unpacking postgresql-10 (10.4-0ubuntu0.18.04) ...
Selecting previously unselected package sysstat.
Preparing to unpack .../sysstat_11.6.1-1_amd64.deb ...
Unpacking sysstat (11.6.1-1) ...
Setting up sysstat (11.6.1-1) ...
Creating config file /etc/default/sysstat with new version
update-alternatives: using /usr/bin/sar.sysstat to provide /usr/bin/sar (sar) in auto mode
Created symlink /etc/systemd/system/multi-user.target.wants/sysstat.service → /lib/systemd/system/sysstat.service.
Processing triggers for ureadahead (0.100.0-20) ...
Setting up postgresql-client-common (190) ...
Processing triggers for systemd (237-3ubuntu10.3) ...
Setting up postgresql-common (190) ...
Adding user postgres to group ssl-cert
Creating config file /etc/postgresql-common/createcluster.conf with new version
Building PostgreSQL dictionaries from installed myspell/hunspell packages...
en_us
Removing obsolete dictionary files:
Created symlink /etc/systemd/system/multi-user.target.wants/postgresql.service → /lib/systemd/system/postgresql.service.
Processing triggers for man-db (2.8.3-2) ...
Setting up postgresql-client-10 (10.4-0ubuntu0.18.04) ...
update-alternatives: using /usr/share/postgresql/10/man/man1/psql.1.gz to provide /usr/share/man/man1/psql.1.gz (psql.1.gz) in auto mode
Setting up postgresql-10 (10.4-0ubuntu0.18.04) ...
Creating new PostgreSQL cluster 10/main ...
/usr/lib/postgresql/10/bin/initdb -D /var/lib/postgresql/10/main --auth-local peer --auth-host md5
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/10/main ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
Ver Cluster Port Status Owner Data directory Log file
10 main 5432 down postgres /var/lib/postgresql/10/main /var/log/postgresql/postgresql-10-main.log
update-alternatives: using /usr/share/postgresql/10/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode
Processing triggers for systemd (237-3ubuntu10.3) ...
Processing triggers for ureadahead (0.100.0-20) ...
Postgres is installed at this point
Data file is stored here: /var/lib/postgresql/10/main
Transaction logs are here: /var/log/postgresql/postgresql-10-main.log
Postgres application userid and group:
id postgres
uid=126(postgres) gid=133(postgres) groups=133(postgres),117(ssl-cert)
Postgres is listening on port: 5432
Database encoding is set to "UTF8" and cluster locale is "en_US.UTF-8"
Start script is
/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
For some reason, when I kick off the instance I kept getting this error
postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.
Connect to oflc database and create test table
krishna=# \c oflc
You are now connected to database "oflc" as user "krishna".
oflc=# create table test (c1 int);
CREATE TABLE
oflc=# insert into test values (1);
INSERT 0 1
oflc=# select * from test;
c1
----
1
(1 row)
oflc=# drop table test;
DROP TABLE
krishna@dev:~$ dpkg -l | grep postgres
ii postgresql-client 10+190 all front-end programs for PostgreSQL (supported version)
ii postgresql-client-10 10.4-0ubuntu0.18.04 amd64 front-end programs for PostgreSQL 10
ii postgresql-client-common 190 all manager for multiple PostgreSQL client versions
Run purge
krishna@dev:~$ sudo apt-get --purge remove postgresql-client postgresql-client-10 postgresql-client-common
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
postgresql-client* postgresql-client-10* postgresql-client-common*
0 upgraded, 0 newly installed, 3 to remove and 2 not upgraded.
After this operation, 3,423 kB disk space will be freed.
Do you want to continue? [Y/n] y
(Reading database ... 203926 files and directories currently installed.)
Removing postgresql-client (10+190) ...
Removing postgresql-client-10 (10.4-0ubuntu0.18.04) ...
Removing postgresql-client-common (190) ...
Processing triggers for man-db (2.8.3-2) ...
(Reading database ... 203671 files and directories currently installed.)
Purging configuration files for postgresql-client-common (190) ...
Install postgres10 using apt-get
krishna@dev:~$ sudo apt-get install postgresql-10
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
postgresql-client-10 postgresql-client-common postgresql-common sysstat
Suggested packages:
locales-all postgresql-doc-10 isag
The following NEW packages will be installed:
postgresql-10 postgresql-client-10 postgresql-client-common postgresql-common sysstat
0 upgraded, 5 newly installed, 0 to remove and 2 not upgraded.
Need to get 4,204 kB/5,167 kB of archives.
After this operation, 20.3 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://us.archive.ubuntu.com/ubuntu bionic/main amd64 postgresql-common all 190 [157 kB]
Get:2 http://us.archive.ubuntu.com/ubuntu bionic-updates/main amd64 postgresql-10 amd64 10.4-0ubuntu0.18.04 [3,752 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu bionic/main amd64 sysstat amd64 11.6.1-1 [295 kB]
Fetched 4,204 kB in 3s (1,497 kB/s)
Preconfiguring packages ...
Selecting previously unselected package postgresql-client-common.
(Reading database ... 203669 files and directories currently installed.)
Preparing to unpack .../postgresql-client-common_190_all.deb ...
Unpacking postgresql-client-common (190) ...
Selecting previously unselected package postgresql-client-10.
Preparing to unpack .../postgresql-client-10_10.4-0ubuntu0.18.04_amd64.deb ...
Unpacking postgresql-client-10 (10.4-0ubuntu0.18.04) ...
Selecting previously unselected package postgresql-common.
Preparing to unpack .../postgresql-common_190_all.deb ...
Adding 'diversion of /usr/bin/pg_config to /usr/bin/pg_config.libpq-dev by postgresql-common'
Unpacking postgresql-common (190) ...
Selecting previously unselected package postgresql-10.
Preparing to unpack .../postgresql-10_10.4-0ubuntu0.18.04_amd64.deb ...
Unpacking postgresql-10 (10.4-0ubuntu0.18.04) ...
Selecting previously unselected package sysstat.
Preparing to unpack .../sysstat_11.6.1-1_amd64.deb ...
Unpacking sysstat (11.6.1-1) ...
Setting up sysstat (11.6.1-1) ...
Creating config file /etc/default/sysstat with new version
update-alternatives: using /usr/bin/sar.sysstat to provide /usr/bin/sar (sar) in auto mode
Created symlink /etc/systemd/system/multi-user.target.wants/sysstat.service → /lib/systemd/system/sysstat.service.
Processing triggers for ureadahead (0.100.0-20) ...
Setting up postgresql-client-common (190) ...
Processing triggers for systemd (237-3ubuntu10.3) ...
Setting up postgresql-common (190) ...
Adding user postgres to group ssl-cert
Creating config file /etc/postgresql-common/createcluster.conf with new version
Building PostgreSQL dictionaries from installed myspell/hunspell packages...
en_us
Removing obsolete dictionary files:
Created symlink /etc/systemd/system/multi-user.target.wants/postgresql.service → /lib/systemd/system/postgresql.service.
Processing triggers for man-db (2.8.3-2) ...
Setting up postgresql-client-10 (10.4-0ubuntu0.18.04) ...
update-alternatives: using /usr/share/postgresql/10/man/man1/psql.1.gz to provide /usr/share/man/man1/psql.1.gz (psql.1.gz) in auto mode
Setting up postgresql-10 (10.4-0ubuntu0.18.04) ...
Creating new PostgreSQL cluster 10/main ...
/usr/lib/postgresql/10/bin/initdb -D /var/lib/postgresql/10/main --auth-local peer --auth-host md5
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/10/main ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
Ver Cluster Port Status Owner Data directory Log file
10 main 5432 down postgres /var/lib/postgresql/10/main /var/log/postgresql/postgresql-10-main.log
update-alternatives: using /usr/share/postgresql/10/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode
Processing triggers for systemd (237-3ubuntu10.3) ...
Processing triggers for ureadahead (0.100.0-20) ...
Postgres is installed at this point
Data file is stored here: /var/lib/postgresql/10/main
Transaction logs are here: /var/log/postgresql/postgresql-10-main.log
Postgres application userid and group:
id postgres
uid=126(postgres) gid=133(postgres) groups=133(postgres),117(ssl-cert)
Postgres is listening on port: 5432
Database encoding is set to "UTF8" and cluster locale is "en_US.UTF-8"
Start script is
/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
For some reason, when I kick off the instance I kept getting this error
postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.
Ended up restarting the server
postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile status
pg_ctl: no server running
postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile restart
pg_ctl: PID file "/var/lib/postgresql/10/main/postmaster.pid" does not exist
Is server running?
starting server anyway
waiting for server to start.... done
server started
postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile status
pg_ctl: server is running (PID: 15574)
/usr/lib/postgresql/10/bin/postgres "-D" "/var/lib/postgresql/10/main" "-c" "config_file=/etc/postgresql/10/main/postgresql.conf"
let's create a role/user, so you are not using the application user id
postgres@dev:~$ psql
psql (10.4 (Ubuntu 10.4-0ubuntu0.18.04))
Type "help" for help.
postgres=# CREATE ROLE krishna SUPERUSER LOGIN REPLICATION CREATEDB CREATEROLE;
CREATE ROLE
postgres=# CREATE DATABASE krishna OWNER krishna;
CREATE DATABASE
postgres=# \q
postgres@dev:~$ exit
logout
Create database
krishna=# create database oflc;
CREATE DATABASE
List all available databases
krishna=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
krishna | krishna | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
oflc | krishna | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(5 rows)
krishna=# \c oflc
You are now connected to database "oflc" as user "krishna".
oflc=# create table test (c1 int);
CREATE TABLE
oflc=# insert into test values (1);
INSERT 0 1
oflc=# select * from test;
c1
----
1
(1 row)
oflc=# drop table test;
DROP TABLE
Other useful commands
oflc=# \c
You are now connected to database "oflc" as user "krishna".
oflc=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
krishna | krishna | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
oflc | krishna | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(5 rows)
oflc=# \conninfo
You are connected to database "oflc" as user "krishna" via socket in "/var/run/postgresql" at port "5432".
oflc=# select * from information_schema.tables where table_schema='public';
oflc=# \q
Finally, shut down server
postgres@dev:~$ /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile stop
waiting for server to shut down.... done
server stopped
To make life easier, wrap start|stop commands into a shell script
cat postgres_instance_manager.sh
#!/bin/bash
#set -x
pg_start() {
sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start'
}
pg_status(){
sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile status'
}
pg_restart(){
sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile restart'
}
pg_stop(){
sudo su - postgres -c '/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile restart'
}
while getopts ":srtk" OPTION; do
echo ${OPTION}
case ${OPTION} in
s)
pg_start
;;
r)
pg_restart
;;
t)
pg_status
;;
k)
pg_stop
;;
\?)
echo "Usage: postgres_instance_manager.sh [-s | START] [-r | RESTART] [-t | STATUS] [-k | STOP]"
exit 1
;;
esac
done
Sunday, June 18, 2017
How to read HBase table from Scala Spark
Step 1: Create a dummy table called customers in HBase, please refer this link on how to populate this table https://mapr.com/products/ mapr-sandbox-hadoop/tutorials/ tutorial-getting-started-with- hbase-shell/
hbase(main):004:0> scan '/user/user01/customer'
ROW COLUMN+CELL
amiller column=addr:state, timestamp=1497809527266, value=TX
jsmith column=addr:city, timestamp=1497809526053, value=denver
jsmith column=addr:state, timestamp=1497809526080, value=CO
jsmith column=order:date, timestamp=1497809490021, value=10-18-2014
jsmith column=order:numb, timestamp=1497809526118, value=6666
njones column=addr:city, timestamp=1497809525860, value=miami
njones column=addr:state, timestamp=1497809526151, value=TX
njones column=order:numb, timestamp=1497809525941, value=5555
tsimmons column=addr:city, timestamp=1497809525998, value=dallas
tsimmons column=addr:state, timestamp=1497809526023, value=TX
4 row(s) in 0.0310 seconds
Step 2: Next is reading this table in Spark, I used spark-shell to read the table and keyValueRDD is what we are looking for
[mapr@maprdemo ~]$ /opt/mapr/spark/spark-2.1.0/ bin/spark-shell
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = local-1497831718510).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0-mapr-1703
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
scala> import org.apache.spark._
import org.apache.spark._
scala> import org.apache.spark.rdd. NewHadoopRDD
import org.apache.spark.rdd. NewHadoopRDD
scala> import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor}
scala> import org.apache.hadoop.hbase. client.HBaseAdmin
import org.apache.hadoop.hbase. client.HBaseAdmin
scala> import org.apache.hadoop.hbase. mapreduce.TableInputFormat
import org.apache.hadoop.hbase. mapreduce.TableInputFormat
scala> import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Path
scala> import org.apache.hadoop.hbase. HColumnDescriptor
import org.apache.hadoop.hbase. HColumnDescriptor
scala> import org.apache.hadoop.hbase.util. Bytes
import org.apache.hadoop.hbase.util. Bytes
scala> import org.apache.hadoop.hbase. client.Put;
import org.apache.hadoop.hbase. client.Put
scala> import org.apache.hadoop.hbase. client.HTable;
import org.apache.hadoop.hbase. client.HTable
scala>
scala>
scala> val tableName="/user/user01/ customer"
tableName: String = /user/user01/customer
scala>
scala>
scala> val hconf = HBaseConfiguration.create()
hconf: org.apache.hadoop.conf. Configuration = Configuration: core-default.xml, org.apache.hadoop.conf. CoreDefaultProperties, core-site.xml, mapred-default.xml, org.apache.hadoop.mapreduce. conf. MapReduceDefaultProperties, mapred-site.xml, yarn-default.xml, org.apache.hadoop.yarn.conf. YarnDefaultProperties, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, hbase-site.xml
scala>
scala> hconf.set(TableInputFormat. INPUT_TABLE, tableName)
scala>
scala>
scala>
scala> val admin = new HBaseAdmin(hconf)
warning: there was one deprecation warning; re-run with -deprecation for details
admin: org.apache.hadoop.hbase. client.HBaseAdmin = org.apache.hadoop.hbase. client.HBaseAdmin@2093bb6c
scala>
scala> val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[org.apache.hadoop. hbase.io. ImmutableBytesWritable], classOf[org.apache.hadoop. hbase.client.Result])
hBaseRDD: org.apache.spark.rdd.RDD[(org. apache.hadoop.hbase.io. ImmutableBytesWritable, org.apache.hadoop.hbase. client.Result)] = NewHadoopRDD[0] at newAPIHadoopRDD at :38
scala> val result = hBaseRDD.count()
result: Long = 4
scala>
scala> val resultRDD = hBaseRDD.map(tuple => tuple._2)
resultRDD: org.apache.spark.rdd.RDD[org. apache.hadoop.hbase.client. Result] = MapPartitionsRDD[1] at map at :40
scala> resultRDD
res1: org.apache.spark.rdd.RDD[org. apache.hadoop.hbase.client. Result] = MapPartitionsRDD[1] at map at :40
scala> val keyValueRDD = resultRDD.map(result =>
| (Bytes.toString(result.getRow( )).split(" ")(0),
| Bytes.toString(result.value)))
keyValueRDD: org.apache.spark.rdd.RDD[( String, String)] = MapPartitionsRDD[2] at map at :44
scala> keyValueRDD.collect()
res2: Array[(String, String)] = Array((amiller,TX), (jsmith,denver), (njones,miami), (tsimmons,dallas))
Friday, April 3, 2015
Bill Gates 2030 Vision
I recently watched this video https://youtu.be/8RETFyDKcw0, Verge interviewed Bill Gates about his vision for 2030, this is the man who predicted every home will have a PC, which turned out to be true, what is Mr.Gates vision fifteen years from would be..
Four key areas for improvement health care, farming, banking and education.
Four key areas for improvement health care, farming, banking and education.
Key take aways:
- This man is serious about his goals/visions, each sector has very specific goals, its very hard to come up with goals/vision
- Health
- Upstream: inventing new vaccines specifically for kids less than five years
- Downstream: How do you get them out to kids around the worls
- Goal: Currently one out of twenty kids dies before age of 5, this should increase to one in forty
- Farming
- Better seeds with resistance to heat & low water, which hints GMO stuff but at least educating farmers about the benefits
- Improved credit & loan systems for farmers
- Increase world food productivity
- Education
- Emphasis on online learning
- Improve critical programming skills
- Basics of reading & writing
- Tablet computing should have a sandbox to test new code
- Banking
- In my view banking vision is radical and I see it coming soon, for small transactions banks are loosing money
- Banking digital infrastructure should be revised to create a utility type service that lets you move money to pay someone else or switch from one bank to another. This calls in for a new regulatory infrastructure where money transfer system is licensed like phone numbers with banks (switching bank accounts should be like switching phone services)
Subscribe to:
Posts (Atom)